date:20190619

This series fixes the fallback of the top-down mmap: in case of
failure, a bottom-up scheme can be tried as a last resort between
the top-down mmap base and the stack, hoping for a large unused stack
limit.

Lots of architectures and even mm code start this fallback
at TASK_UNMAPPED_BASE, which is useless since the top-down scheme
already failed on the whole address space: instead, simply use
mmap_base.

Along the way, it allows to get rid of of mmap_legacy_base and
mmap_compat_legacy_base from mm_struct.

Note that arm and mips already implement this behaviour. 

Alexandre Ghiti (8):
  s390: Start fallback of top-down mmap at mm->mmap_base
  sh: Start fallback of top-down mmap at mm->mmap_base
  sparc: Start fallback of top-down mmap at mm->mmap_base
  x86, hugetlbpage: Start fallback of top-down mmap at mm->mmap_base
  mm: Start fallback top-down mmap at mm->mmap_base
  parisc: Use mmap_base, not mmap_legacy_base, as low_limit for
bottom-up mmap
  x86: Use mmap_*base, not mmap_*legacy_base, as low_limit for bottom-up
mmap
  mm: Remove mmap_legacy_base and mmap_compat_legacy_code fields from
mm_struct

 arch/parisc/kernel/sys_parisc.c  |  8 +++-
 arch/s390/mm/mmap.c  |  2 +-
 arch/sh/mm/mmap.c|  2 +-
 arch/sparc/kernel/sys_sparc_64.c |  2 +-
 arch/sparc/mm/hugetlbpage.c  |  2 +-
 arch/x86/include/asm/elf.h   |  2 +-
 arch/x86/kernel/sys_x86_64.c |  4 ++--
 arch/x86/mm/hugetlbpage.c|  7 ---
 arch/x86/mm/mmap.c   | 20 +---
 include/linux/mm_types.h |  2 --
 mm/debug.c   |  4 ++--
 mm/mmap.c|  2 +-
 12 files changed, 26 insertions(+), 31 deletions(-)

-- 
2.20.1

[PATCH] USB: core: correct a spelling mistake in the comment

2019-06-19 Thread Harry Pan

Fix a spelling typo in the function comment.

Signed-off-by: Harry Pan 

---

 drivers/usb/core/hub.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
index 8d4631c81b9f..1988f8f88f75 100644
--- a/drivers/usb/core/hub.c
+++ b/drivers/usb/core/hub.c
@@ -2719,7 +2719,7 @@ static bool use_new_scheme(struct usb_device *udev, int 
retry,
 }
 
 /* Is a USB 3.0 port in the Inactive or Compliance Mode state?
- * Port worm reset is required to recover
+ * Port warm reset is required to recover
  */
 static bool hub_port_warm_reset_required(struct usb_hub *hub, int port1,
u16 portstatus)
-- 
2.20.1

Re: linux-next: manual merge of the jc_docs tree with the char-misc.current tree

2019-06-19 Thread Greg KH

On Thu, Jun 20, 2019 at 11:11:28AM +1000, Stephen Rothwell wrote:
> Hi all,
> 
> Today's linux-next merge of the jc_docs tree got a conflict in:
> 
>   Documentation/fb/fbcon.rst
> 
> between commit:
> 
>   fce677d7e8f0 ("docs: fb: Add TER16x32 to the available font names")
> 
> from the char-misc.current tree and commit:
> 
>   ab42b818954c ("docs: fb: convert docs to ReST and rename to *.rst")
> 
> from the jc_docs tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
> 
> -- 
> Cheers,
> Stephen Rothwell
> 
> diff --cc Documentation/fb/fbcon.rst
> index 5a865437b33f,cfb9f7c38f18..
> --- a/Documentation/fb/fbcon.rst
> +++ b/Documentation/fb/fbcon.rst
> @@@ -77,12 -80,12 +80,12 @@@ C. Boot option
>   
>   1. fbcon=font:
>   
> - Select the initial font to use. The value 'name' can be any of the
> - compiled-in fonts: 10x18, 6x10, 7x14, Acorn8x8, MINI4x6,
> - PEARL8x8, ProFont6x11, SUN12x22, SUN8x16, TER16x32, VGA8x16, VGA8x8.
> + Select the initial font to use. The value 'name' can be any of the
> + compiled-in fonts: 10x18, 6x10, 7x14, Acorn8x8, MINI4x6,
>  -PEARL8x8, ProFont6x11, SUN12x22, SUN8x16, VGA8x16, VGA8x8.
> ++PEARL8x8, ProFont6x11, SUN12x22, SUN8x16, TER16x32, VGA8x16, VGA8x8.
>   
>   Note, not all drivers can handle font with widths not divisible by 8,
> - such as vga16fb.
> + such as vga16fb.
>   
>   2. fbcon=scrollback:[k]
>   

Fix looks good to me, thanks!

greg k-h

RE: [PATCH v2 2/5] net: macb: add support for sgmii MAC-PHY interface

2019-06-19 Thread Parshuram Raju Thombare

>From: Russell King - ARM Linux admin 
>
>On Wed, Jun 19, 2019 at 11:23:01AM +, Parshuram Raju Thombare wrote:
>
>> >From: Russell King - ARM Linux admin 
>
>> >
>
>> >On Wed, Jun 19, 2019 at 09:40:46AM +0100, Parshuram Thombare wrote:
>
>> >
>
>> >> This patch add support for SGMII interface) and
>
>> >
>
>> >> 2.5Gbps MAC in Cadence ethernet controller driver.
>
>>
>
>> >>   switch (state->interface) {
>
>> >
>
>> >> + case PHY_INTERFACE_MODE_SGMII:
>
>> >
>
>> >> + if (bp->caps & MACB_CAPS_GIGABIT_MODE_AVAILABLE)
>
>> >
>
>> >> + phylink_set(mask, 2500baseT_Full);
>
>> >
>
>> >
>
>> >
>
>> >This doesn't look correct to me.  SGMII as defined by Cisco only
>
>> >supports 1G, 100M and 10M speeds, not 2.5G.
>
>>
>
>> Cadence MAC support 2.5G SGMII by using higher clock frequency.
>
>
>
>Ok, so why not set 2.5GBASE-X too?  Does the MAC handle auto-detecting
>
>the SGMII/BASE-X speed itself or does it need to be programmed?  If it
>
>needs to be programmed, you need additional handling in the validate
>
>callback to deal with that.

No, currently MAC can't auto detect it, it need to be programmed.
But I think programming speed/duplex mode is already done for non in-band
modes in mac_config.
For in band mode, I see two places to config MAC speed
and duplex mode, 1. mac_link_state 2. mac_link_up. In mac_link_up, though state
read from mac_link_state is passed, it is only used for printing log and 
updating
pl->cur_interface, so if configuring MAC speed/duplex mode in mac_link_up is 
correct, 
these parameters will need to read again from HW.

>> >> + case PHY_INTERFACE_MODE_2500BASEX:
>
>> >
>
>> >> + if (bp->caps & MACB_CAPS_GIGABIT_MODE_AVAILABLE)
>
>> >
>
>> >> + phylink_set(mask, 2500baseX_Full);
>
>> >
>
>> >> + /* fallthrough */
>
>> >
>
>> >> + case PHY_INTERFACE_MODE_1000BASEX:
>
>> >
>
>> >> + if (bp->caps & MACB_CAPS_GIGABIT_MODE_AVAILABLE)
>
>> >
>
>> >> + phylink_set(mask, 1000baseX_Full);
>
>> >
>
>> >> + break;
>
>> >
>
>> >
>
>> >
>
>> >Please see how other drivers which use phylink deal with the validate()
>
>> >format, and please read the phylink documentation:
>
>> >
>
>> > * Note that the PHY may be able to transform from one connection
>
>> > * technology to another, so, eg, don't clear 1000BaseX just
>
>> > * because the MAC is unable to BaseX mode. This is more about
>
>> > * clearing unsupported speeds and duplex settings.
>
>> >
>
>>
>
>> There are some configs used in this driver which limits MAC speed.
>
>> Above checks just to make sure this use case does not break.
>
>
>
>That's not what I'm saying.
>
>
>
>By way of example, you're offering 1000BASE-T just because the MAC
>
>connection supports it.  However, the MAC doesn't _actually_ support
>
>1000BASE-T, it supports a connection to a PHY that _happens_ to
>
>convert the MAC connection to 1000BASE-T.  It could equally well
>
>convert the MAC connection to 1000BASE-X.
>
>
>
>So, only setting 1000BASE-X when you have a PHY connection using
>
>1000BASE-X is fundamentally incorrect.
>
>
>
>For example, you could have a MAC <-> PHY link using standard 1.25Gbps
>   
>SGMII, and the PHY offers 1000BASE-T _and_ 1000BASE-X connections on
>
>a first-link-up basis.  An example of a PHY that does this are the
>
>Marvell 1G PHYs (eg, 88E151x).
>
>
>
>This point is detailed in the PHYLINK documentation, which I quoted
>
>above.
Ok, I will not clear 1000/2500BASE-T for PHY connection is just 1000/2500BASE-X
Also I will keep 1000/2500BASE-X link modes for SGMII/GMII modes.

>
>
>> >> @@ -506,18 +563,26 @@ static void gem_mac_config(struct phylink_config
>
>> >*pl_config, unsigned int mode,
>
>> >>   switch (state->speed) {
>
>> >> + case SPEED_2500:
>
>> >> + gem_writel(bp, NCFGR, GEM_BIT(GBE) |
>
>> >> +gem_readl(bp, NCFGR));
>
>> >>   }
>
>> >> - macb_or_gem_writel(bp, NCFGR, reg);
>
>> >>
>
>> >>   bp->speed = state->speed;
>
>> >>   bp->duplex = state->duplex;
>
>> >
>
>> >
>
>> >
>
>> >This is not going to work for 802.3z nor SGMII properly when in-band
>
>> >negotiation is used.  We don't know ahead of time what the speed and
>
>> >duplex will be.  Please see existing drivers for examples showing
>
>> >how mac_config() should be implemented (there's good reason why its
>
>> >laid out as it is in those drivers.)
>
>> >
>
>> Ok, Here I will configure MAC only for FIXED and PHY mode.
>
>
>
>As you are not the only one who has made this error, I'm considering
>
>splitting mac_config() into mac_config_fixed() and mac_config_inband()
>
>so that it's clearer what is required.  Maybe even taking separate
>
>structures so that it's impossible to access members that should not
>
>be used.
>
For in band mode, I see two places to config MAC speed
and duplex mode - 1. mac_link_state 2. mac_link_up. 
In mac_link_up, though state read from mac_link_state is passed, 
it is only

[PATCH] hung_task: recover hung task warnings in next check interval

2019-06-19 Thread Yafang Shao

When sys_hung_task_warnings reaches 0, the hang task messages will not
be reported any more.

If the user want to get more hung task messages, he must reset
kernel.hung_task_warnings to a postive integer or -1 with sysctl.
This is not a good way for the user.
We'd better reset hung task warnings in the kernel, and then the user
don't need to pay attention to this value.

With this patch, hung task warnings will be reset with
sys_hung_task_warnings setting in evenry check interval.

Another difference is if the user set kernel.hung_task_warnings with a
new value, the new value will take effect in next check interval.
For example, when the kernel is printing the hung task messages, the
user can't set it to 0 to stop the printing, but I don't think this will
happen in the real world. (If that happens, then sys_hung_task_warnings
must be protected by a lock)

Signed-off-by: Yafang Shao 
---
 Documentation/sysctl/kernel.txt |  5 -
 kernel/hung_task.c  | 19 ---
 2 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index f0c86fb..350df41 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -377,6 +377,8 @@ This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
 
 0 (default): means use hung_task_timeout_secs as checking interval.
 Possible values to set are in range {0..LONG_MAX/HZ}.
+hung_task_check_interval_secs must not be set greater than
+hung_task_timeout_secs.
 
 ==
 
@@ -384,7 +386,8 @@ hung_task_warnings:
 
 The maximum number of warnings to report. During a check interval
 if a hung task is detected, this value is decreased by 1.
-When this value reaches 0, no more warnings will be reported.
+When this value reaches 0, no more warnings will be reported until
+next check interval begins.
 This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
 
 -1: report an infinite number of warnings.
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 14a625c..01e6c94 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -85,7 +85,8 @@ static int __init hung_task_panic_setup(char *str)
.notifier_call = hung_task_panic,
 };
 
-static void check_hung_task(struct task_struct *t, unsigned long timeout)
+static void check_hung_task(struct task_struct *t, unsigned long timeout,
+   int *warnings)
 {
unsigned long switch_count = t->nvcsw + t->nivcsw;
 
@@ -124,9 +125,9 @@ static void check_hung_task(struct task_struct *t, unsigned 
long timeout)
 * Ok, the task did not get scheduled for more than 2 minutes,
 * complain:
 */
-   if (sysctl_hung_task_warnings) {
-   if (sysctl_hung_task_warnings > 0)
-   sysctl_hung_task_warnings--;
+   if (*warnings) {
+   if (*warnings > 0)
+   (*warnings)--;
pr_err("INFO: task %s:%d blocked for more than %ld seconds.\n",
   t->comm, t->pid, (jiffies - t->last_switch_time) / HZ);
pr_err("  %s %s %.*s\n",
@@ -170,7 +171,8 @@ static bool rcu_lock_break(struct task_struct *g, struct 
task_struct *t)
  * a really long time (120 seconds). If that happens, print out
  * a warning.
  */
-static void check_hung_uninterruptible_tasks(unsigned long timeout)
+static void check_hung_uninterruptible_tasks(unsigned long timeout,
+int *warnings)
 {
int max_count = sysctl_hung_task_check_count;
unsigned long last_break = jiffies;
@@ -195,7 +197,7 @@ static void check_hung_uninterruptible_tasks(unsigned long 
timeout)
}
/* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */
if (t->state == TASK_UNINTERRUPTIBLE)
-   check_hung_task(t, timeout);
+   check_hung_task(t, timeout, warnings);
}
  unlock:
rcu_read_unlock();
@@ -271,6 +273,7 @@ static int hungtask_pm_notify(struct notifier_block *self,
 static int watchdog(void *dummy)
 {
unsigned long hung_last_checked = jiffies;
+   int warnings;
 
set_user_nice(current, 0);
 
@@ -284,9 +287,11 @@ static int watchdog(void *dummy)
interval = min_t(unsigned long, interval, timeout);
t = hung_timeout_jiffies(hung_last_checked, interval);
if (t <= 0) {
+   warnings = sysctl_hung_task_warnings;
if (!atomic_xchg(_hung_task, 0) &&
!hung_detector_suspended)
-   check_hung_uninterruptible_tasks(timeout);
+   check_hung_uninterruptible_tasks(timeout,
+);
hung_last_checked = jiffies;
continue;

Re: [RFC 0/2] Support for buttons on newer MS Surface devices

2019-06-19 Thread Andy Shevchenko

On Wed, Jun 12, 2019 at 2:06 AM Maximilian Luz  wrote:
>
> Since there are no comments on this, should I simply submit this as patch?

No top post, please.

And yes, submit it as a series. Also Cc to Benjamin Tissoires.

> On 6/1/19 9:07 PM, Maximilian Luz wrote:
> > Hi,
> >
> > any comments on this?
> >
> > I should also mention that this has been tested via
> > https://github.com/jakeday/linux-surface.
> >
> > Maximilian



-- 
With Best Regards,
Andy Shevchenko

Re: [PATCH] slub: Don't panic for memcg kmem cache creation failure

2019-06-19 Thread Michal Hocko

On Wed 19-06-19 16:25:14, Shakeel Butt wrote:
> Currently for CONFIG_SLUB, if a memcg kmem cache creation is failed and
> the corresponding root kmem cache has SLAB_PANIC flag, the kernel will
> be crashed. This is unnecessary as the kernel can handle the creation
> failures of memcg kmem caches.

AFAICS it will handle those by simply not accounting those objects
right?

> Additionally CONFIG_SLAB does not
> implement this behavior. So, to keep the behavior consistent between
> SLAB and SLUB, removing the panic for memcg kmem cache creation
> failures. The root kmem cache creation failure for SLAB_PANIC correctly
> panics for both SLAB and SLUB.

I do agree that panicing is really dubious especially because it opens
doors to shut the system down from a restricted environment. So the
patch makes sesne to me.

I am wondering whether SLAB_PANIC makes sense in general though. Why is
it any different from any other essential early allocations? We tend to
not care about allocation failures for those on bases that the system
must be in a broken state to fail that early already. Do you think it is
time to remove SLAB_PANIC altogether?

> Reported-by: Dave Hansen 
> Signed-off-by: Shakeel Butt 

Acked-by: Michal Hocko 

> ---
>  mm/slub.c | 4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 6a5174b51cd6..84c6508e360d 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3640,10 +3640,6 @@ static int kmem_cache_open(struct kmem_cache *s, 
> slab_flags_t flags)
>  
>   free_kmem_cache_nodes(s);
>  error:
> - if (flags & SLAB_PANIC)
> - panic("Cannot create slab %s size=%u realsize=%u order=%u 
> offset=%u flags=%lx\n",
> -   s->name, s->size, s->size,
> -   oo_order(s->oo), s->offset, (unsigned long)flags);
>   return -EINVAL;
>  }
>  
> -- 
> 2.22.0.410.gd8fdbe21b5-goog

-- 
Michal Hocko
SUSE Labs

linux-next: manual merge of the char-misc tree with the driver-core tree

Hi all,

Today's linux-next merge of the char-misc tree got a conflict in:

  drivers/misc/mei/debugfs.c

between commit:

  5666d896e838 ("mei: no need to check return value of debugfs_create 
functions")

from the driver-core tree and commit:

  b728ddde769c ("mei: Convert to use DEFINE_SHOW_ATTRIBUTE macro")

from the char-misc tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/misc/mei/debugfs.c
index df6bf8b81936,47cfd5005e1b..
--- a/drivers/misc/mei/debugfs.c
+++ b/drivers/misc/mei/debugfs.c
@@@ -233,22 -154,46 +154,21 @@@ void mei_dbgfs_deregister(struct mei_de
   *
   * @dev: the mei device structure
   * @name: the mei device name
 - *
 - * Return: 0 on success, <0 on failure.
   */
 -int mei_dbgfs_register(struct mei_device *dev, const char *name)
 +void mei_dbgfs_register(struct mei_device *dev, const char *name)
  {
 -  struct dentry *dir, *f;
 +  struct dentry *dir;
  
dir = debugfs_create_dir(name, NULL);
 -  if (!dir)
 -  return -ENOMEM;
 -
dev->dbgfs_dir = dir;
  
 -  f = debugfs_create_file("meclients", S_IRUSR, dir,
 -  dev, _dbgfs_meclients_fops);
 -  if (!f) {
 -  dev_err(dev->dev, "meclients: registration failed\n");
 -  goto err;
 -  }
 -  f = debugfs_create_file("active", S_IRUSR, dir,
 -  dev, _dbgfs_active_fops);
 -  if (!f) {
 -  dev_err(dev->dev, "active: registration failed\n");
 -  goto err;
 -  }
 -  f = debugfs_create_file("devstate", S_IRUSR, dir,
 -  dev, _dbgfs_devstate_fops);
 -  if (!f) {
 -  dev_err(dev->dev, "devstate: registration failed\n");
 -  goto err;
 -  }
 -  f = debugfs_create_file("allow_fixed_address", S_IRUSR | S_IWUSR, dir,
 -  >allow_fixed_address,
 -  _dbgfs_allow_fa_fops);
 -  if (!f) {
 -  dev_err(dev->dev, "allow_fixed_address: registration failed\n");
 -  goto err;
 -  }
 -  return 0;
 -err:
 -  mei_dbgfs_deregister(dev);
 -  return -ENODEV;
 +  debugfs_create_file("meclients", S_IRUSR, dir, dev,
-   _dbgfs_fops_meclients);
++  _dbgfs_meclients_fops);
 +  debugfs_create_file("active", S_IRUSR, dir, dev,
-   _dbgfs_fops_active);
++  _dbgfs_active_fops);
 +  debugfs_create_file("devstate", S_IRUSR, dir, dev,
-   _dbgfs_fops_devstate);
++  _dbgfs_devstate_fops);
 +  debugfs_create_file("allow_fixed_address", S_IRUSR | S_IWUSR, dir,
 +  >allow_fixed_address,
-   _dbgfs_fops_allow_fa);
++  _dbgfs_allow_fa_fops);
  }
- 


pgp31e6sHCbxh.pgp
Description: OpenPGP digital signature

[PATCH 1/1] staging: media: fix style problem

2019-06-19 Thread Aliasgar Surti

From: Aliasgar Surti 

checkpatch reported "WARNING: line over 80 characters".
This patch fixes the warning for file davinci_vpfe/dm365_isif.c

Signed-off-by: Aliasgar Surti 
---
 drivers/staging/media/davinci_vpfe/dm365_isif.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/media/davinci_vpfe/dm365_isif.c 
b/drivers/staging/media/davinci_vpfe/dm365_isif.c
index 46fd818..e9c8de1 100644
--- a/drivers/staging/media/davinci_vpfe/dm365_isif.c
+++ b/drivers/staging/media/davinci_vpfe/dm365_isif.c
@@ -532,7 +532,8 @@ static int isif_validate_dfc_params(const struct 
vpfe_isif_dfc *dfc)
 #define DM365_ISIF_MAX_CLVSV   0x1fff
 #define DM365_ISIF_MAX_HEIGHT_BLACK_REGION 0x1fff
 
-static int isif_validate_bclamp_params(const struct vpfe_isif_black_clamp 
*bclamp)
+static int
+isif_validate_bclamp_params(const struct vpfe_isif_black_clamp *bclamp)
 {
int err = -EINVAL;
 
@@ -593,7 +594,8 @@ isif_validate_raw_params(const struct vpfe_isif_raw_config 
*params)
return isif_validate_bclamp_params(>bclamp);
 }
 
-static int isif_set_params(struct v4l2_subdev *sd, const struct 
vpfe_isif_raw_config *params)
+static int isif_set_params(struct v4l2_subdev *sd,
+  const struct vpfe_isif_raw_config *params)
 {
struct vpfe_isif_device *isif = v4l2_get_subdevdata(sd);
int ret = -EINVAL;
-- 
2.7.4

selftests: bpf: test_align Test 4 unknown shift Failed to find match 7 R0=pkt(id=0,off=8,r=8,imm=0)

2019-06-19 Thread Naresh Kamboju

selftests: bpf: test_align failed running Linux -next kernel
5.2.0-rc5-next-20190619.

Here is the log from x86_64,

# selftests bpf test_align
bpf: test_align_ #
# Test   0 mov ... PASS
0: mov_... #
# Test   1 shift ... PASS
1: shift_... #
# Test   2 addsub ... PASS
2: addsub_... #
# Test   3 mul ... PASS
3: mul_... #
# Test   4 unknown shift ... Failed to find match 7 R0=pkt(id=0,off=8,r=8,imm=0)
4: unknown_shift #
# func#0 @0
@0: _ #
# 0 R1=ctx(id=0,off=0,imm=0) R10=fp0
R1=ctx(id=0,off=0,imm=0): R10=fp0_ #
# 0 (61) r2 = *(u32 *)(r1 +76)
(61): r2_= #
# 1 R1=ctx(id=0,off=0,imm=0) R2_w=pkt(id=0,off=0,r=0,imm=0) R10=fp0
R1=ctx(id=0,off=0,imm=0): R2_w=pkt(id=0,off=0,r=0,imm=0)_R10=fp0 #
# 1 (61) r3 = *(u32 *)(r1 +80)
(61): r3_= #
# 2 R1=ctx(id=0,off=0,imm=0) R2_w=pkt(id=0,off=0,r=0,imm=0)
R3_w=pkt_end(id=0,off=0,imm=0) R10=fp0
R1=ctx(id=0,off=0,imm=0):
R2_w=pkt(id=0,off=0,r=0,imm=0)_R3_w=pkt_end(id=0,off=0,imm=0) #
# 2 (bf) r0 = r2
(bf): r0_= #
...
# processed 22 insns (limit 100) max_states_per_insn 0
total_states 1 peak_states 1 mark_read 1
22: insns_(limit #
# FAIL
: _ #
# Results 6 pass 6 fail
6: pass_6 #
[FAIL] 7 selftests bpf test_align
selftests: bpf_test_align [FAIL]

Full test log,
https://qa-reports.linaro.org/lkft/linux-next-oe/build/next-20190619/testrun/781777/log

Test results comparison,
https://qa-reports.linaro.org/lkft/linux-next-oe/tests/kselftest/bpf_test_align

Good linux -next tag: next-20190618
Bad linux -next tag: next-20190619
git branch master
git commitc0e4c41afeef66d21dc5704f614624cecac806ac
git describe  next-20190618
git repo
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git

Best regards
Naresh Kamboju

Re: [PATCH v2] ocxl: Allow contexts to be attached with a NULL mm

2019-06-19 Thread Nicholas Piggin

Alastair D'Silva's on June 20, 2019 2:12 pm:
> From: Alastair D'Silva 
> 
> If an OpenCAPI context is to be used directly by a kernel driver, there
> may not be a suitable mm to use.
> 
> The patch makes the mm parameter to ocxl_context_attach optional.
> 
> Signed-off-by: Alastair D'Silva 

Yeah I don't think you need to manage a kernel context explicitly
because it will always be flushed with tlbie, comment helps. For
the powerpc/mm bit,

Acked-by: Nicholas Piggin

[PATCH v2 4/5] nvme: move common definitions to pci.h

From: Dan Williams 

A platform-driver for nvme resources needs access to struct nvme_dev and
other definitions that are currently local to pci.c.

Signed-off-by: Dan Williams 
Signed-off-by: Daniel Drake 
---
 drivers/nvme/host/pci.c | 125 +---
 drivers/nvme/host/pci.h | 136 
 2 files changed, 137 insertions(+), 124 deletions(-)
 create mode 100644 drivers/nvme/host/pci.h

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 23bda524f16b..bed6c91b6b7c 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -26,6 +26,7 @@
 
 #include "trace.h"
 #include "nvme.h"
+#include "pci.h"
 
 #define SQ_SIZE(depth) (depth * sizeof(struct nvme_command))
 #define CQ_SIZE(depth) (depth * sizeof(struct nvme_completion))
@@ -83,97 +84,9 @@ static int poll_queues = 0;
 module_param_cb(poll_queues, _count_ops, _queues, 0644);
 MODULE_PARM_DESC(poll_queues, "Number of queues to use for polled IO.");
 
-struct nvme_dev;
-struct nvme_queue;
-
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
 static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode);
 
-struct nvme_dev_ops {
-   /* Enable device (required) */
-   int (*enable)(struct nvme_dev *dev);
-
-   /* Disable device (required) */
-   void (*disable)(struct nvme_dev *dev);
-
-   /* Allocate IRQ vectors for given number of io queues (required) */
-   int (*setup_irqs)(struct nvme_dev *dev, int nr_io_queues);
-
-   /* Get the IRQ vector for a specific queue */
-   int (*q_irq)(struct nvme_queue *q);
-
-   /* Allocate device-specific SQ command buffer (optional) */
-   int (*cmb_alloc_sq_cmds)(struct nvme_queue *nvmeq, size_t size,
-struct nvme_command **sq_cmds,
-dma_addr_t *sq_dma_addr);
-
-   /* Free device-specific SQ command buffer (optional) */
-   void (*cmb_free_sq_cmds)(struct nvme_queue *nvmeq,
-struct nvme_command *sq_cmds, size_t size);
-
-   /* Device-specific mapping of blk queues to CPUs (optional) */
-   int (*map_queues)(struct nvme_dev *dev, struct blk_mq_queue_map *map,
- int offset);
-
-   /* Check if device is enabled on the bus (required) */
-   int (*is_enabled)(struct nvme_dev *dev);
-
-   /* Check if channel is in running state (required) */
-   int (*is_offline)(struct nvme_dev *dev);
-
-   /* Check if device is present and responding (optional) */
-   bool (*is_present)(struct nvme_dev *dev);
-
-   /* Check & log device state before it gets reset (optional) */
-   void (*warn_reset)(struct nvme_dev *dev);
-};
-
-/*
- * Represents an NVM Express device.  Each nvme_dev is a PCI function.
- */
-struct nvme_dev {
-   const struct resource *res;
-   const struct nvme_dev_ops *ops;
-   struct nvme_queue *queues;
-   struct blk_mq_tag_set tagset;
-   struct blk_mq_tag_set admin_tagset;
-   u32 __iomem *dbs;
-   struct device *dev;
-   struct dma_pool *prp_page_pool;
-   struct dma_pool *prp_small_pool;
-   unsigned online_queues;
-   unsigned max_qid;
-   unsigned io_queues[HCTX_MAX_TYPES];
-   unsigned int num_vecs;
-   int q_depth;
-   u32 db_stride;
-   void __iomem *bar;
-   unsigned long bar_mapped_size;
-   struct work_struct remove_work;
-   struct mutex shutdown_lock;
-   bool subsystem;
-   u64 cmb_size;
-   bool cmb_use_sqes;
-   u32 cmbsz;
-   u32 cmbloc;
-   struct nvme_ctrl ctrl;
-
-   mempool_t *iod_mempool;
-
-   /* shadow doorbell buffer support: */
-   u32 *dbbuf_dbs;
-   dma_addr_t dbbuf_dbs_dma_addr;
-   u32 *dbbuf_eis;
-   dma_addr_t dbbuf_eis_dma_addr;
-
-   /* host memory buffer support: */
-   u64 host_mem_size;
-   u32 nr_host_mem_descs;
-   dma_addr_t host_mem_descs_dma;
-   struct nvme_host_mem_buf_desc *host_mem_descs;
-   void **host_mem_desc_bufs;
-};
-
 static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
 {
int n = 0, ret;
@@ -213,42 +126,6 @@ static inline struct nvme_dev *to_nvme_dev(struct 
nvme_ctrl *ctrl)
return container_of(ctrl, struct nvme_dev, ctrl);
 }
 
-/*
- * An NVM Express queue.  Each device has at least two (one for admin
- * commands and one for I/O commands).
- */
-struct nvme_queue {
-   struct nvme_dev *dev;
-   char irqname[24];   /* nvme4294967295-65535\0 */
-   spinlock_t sq_lock;
-   struct nvme_command *sq_cmds;
-/* only used for poll queues: */
-   spinlock_t cq_poll_lock cacheline_aligned_in_smp;
-   volatile struct nvme_completion *cqes;
-   struct blk_mq_tags **tags;
-   dma_addr_t sq_dma_addr;
-   dma_addr_t cq_dma_addr;
-   u32 __iomem *q_db;
-   u16 q_depth;
-   u16 cq_vector;
-   u16

[PATCH v2 5/5] nvme: Intel AHCI remap support

Provide a platform driver for the nvme resources that may be remapped
behind an ahci bar on common Intel platforms. The implementation relies
on the standard nvme driver, but reimplements the nvme_dev_ops accordingly.

As the original NVMe PCI device is inaccessible, this driver is somewhat
limited: we always assume the device is present & online, can't
detect PCI errors, can't reset, power management is limited, etc.

A single shared legacy interrupt is used, although there is some
hope that MSI-X support could be added later.

Based on previous code by Dan Williams.

Signed-off-by: Daniel Drake 
---
 drivers/ata/Kconfig  |   1 +
 drivers/nvme/host/Kconfig|   3 +
 drivers/nvme/host/Makefile   |   3 +
 drivers/nvme/host/intel-ahci-remap.c | 185 +++
 drivers/nvme/host/pci.c  |  21 +--
 drivers/nvme/host/pci.h  |   9 ++
 6 files changed, 214 insertions(+), 8 deletions(-)
 create mode 100644 drivers/nvme/host/intel-ahci-remap.c

diff --git a/drivers/ata/Kconfig b/drivers/ata/Kconfig
index 6e82d66d7516..fb64e690d325 100644
--- a/drivers/ata/Kconfig
+++ b/drivers/ata/Kconfig
@@ -113,6 +113,7 @@ config SATA_AHCI_INTEL_NVME_REMAP
bool "AHCI: Intel Remapped NVMe device support"
depends on SATA_AHCI
depends on BLK_DEV_NVME
+   select NVME_INTEL_AHCI_REMAP
help
  Support access to remapped NVMe devices that appear in AHCI PCI
  memory space.
diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index ec43ac9199e2..a8aefb18eb15 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -26,6 +26,9 @@ config NVME_MULTIPATH
 config NVME_FABRICS
tristate
 
+config NVME_INTEL_AHCI_REMAP
+   tristate
+
 config NVME_RDMA
tristate "NVM Express over Fabrics RDMA host driver"
depends on INFINIBAND && INFINIBAND_ADDR_TRANS && BLOCK
diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile
index 8a4b671c5f0c..2010169880b7 100644
--- a/drivers/nvme/host/Makefile
+++ b/drivers/nvme/host/Makefile
@@ -8,6 +8,7 @@ obj-$(CONFIG_NVME_FABRICS)  += nvme-fabrics.o
 obj-$(CONFIG_NVME_RDMA)+= nvme-rdma.o
 obj-$(CONFIG_NVME_FC)  += nvme-fc.o
 obj-$(CONFIG_NVME_TCP) += nvme-tcp.o
+obj-$(CONFIG_NVME_INTEL_AHCI_REMAP)+= nvme-intel-ahci-remap.o
 
 nvme-core-y:= core.o
 nvme-core-$(CONFIG_TRACING)+= trace.o
@@ -24,3 +25,5 @@ nvme-rdma-y   += rdma.o
 nvme-fc-y  += fc.o
 
 nvme-tcp-y += tcp.o
+
+nvme-intel-ahci-remap-y+= intel-ahci-remap.o
diff --git a/drivers/nvme/host/intel-ahci-remap.c 
b/drivers/nvme/host/intel-ahci-remap.c
new file mode 100644
index ..7194d9dd0016
--- /dev/null
+++ b/drivers/nvme/host/intel-ahci-remap.c
@@ -0,0 +1,185 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Intel AHCI remapped NVMe platform driver
+ *
+ * Copyright (c) 2011-2016, Intel Corporation.
+ * Copyright (c) 2019, Endless Mobile, Inc.
+ *
+ * Support platform devices created by the ahci driver, corresponding to
+ * NVMe devices that have been remapped into the ahci device memory space.
+ *
+ * This scheme is rather peculiar, as NVMe is inherently based on PCIe,
+ * however we only have access to the NVMe device MMIO space and an
+ * interrupt. Without access to the pci_device, many features are
+ * unavailable; this driver only intends to offer basic functionality.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include "pci.h"
+
+struct ahci_remap_data {
+   atomic_t enabled;
+};
+
+static struct ahci_remap_data *to_ahci_remap_data(struct nvme_dev *dev)
+{
+   return dev->dev->platform_data;
+}
+
+static int ahci_remap_enable(struct nvme_dev *dev)
+{
+   int rc;
+   struct resource *res;
+   struct device *ddev = dev->dev;
+   struct device *parent = ddev->parent;
+   struct ahci_remap_data *adata = to_ahci_remap_data(dev);
+   struct platform_device *pdev = to_platform_device(ddev);
+
+   res = platform_get_resource(pdev, IORESOURCE_IRQ, 0);
+   if (!res)
+   return -ENXIO;
+
+   /* parent ahci device determines the dma mask */
+   if (dma_supported(parent, DMA_BIT_MASK(64)))
+   rc = dma_coerce_mask_and_coherent(ddev, DMA_BIT_MASK(64));
+   else if (dma_supported(parent, DMA_BIT_MASK(32)))
+   rc = dma_coerce_mask_and_coherent(ddev, DMA_BIT_MASK(32));
+   else
+   rc = -ENXIO;
+   if (rc)
+   return rc;
+
+   rc = nvme_enable(dev);
+   if (rc)
+   return rc;
+
+   atomic_inc(>enabled);
+
+   return 0;
+}
+
+static int ahci_remap_is_enabled(struct nvme_dev *dev)
+{
+   struct ahci_remap_data *adata = to_ahci_remap_data(dev);
+
+   return atomic_read(>enabled) > 0;
+}
+
+static

[PATCH v2 2/5] nvme: rename "pci" operations to "mmio"

From: Dan Williams 

In preparation for adding a platform_device nvme host, rename to a more
generic "mmio" prefix.

Signed-off-by: Dan Williams 
Signed-off-by: Daniel Drake 
---
 drivers/nvme/host/pci.c | 28 ++--
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 524d6bd6d095..42990b93349d 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1108,7 +1108,7 @@ static int nvme_poll(struct blk_mq_hw_ctx *hctx)
return found;
 }
 
-static void nvme_pci_submit_async_event(struct nvme_ctrl *ctrl)
+static void nvme_mmio_submit_async_event(struct nvme_ctrl *ctrl)
 {
struct nvme_dev *dev = to_nvme_dev(ctrl);
struct nvme_queue *nvmeq = >queues[0];
@@ -2448,7 +2448,7 @@ static void nvme_release_prp_pools(struct nvme_dev *dev)
dma_pool_destroy(dev->prp_small_pool);
 }
 
-static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
+static void nvme_mmio_free_ctrl(struct nvme_ctrl *ctrl)
 {
struct nvme_dev *dev = to_nvme_dev(ctrl);
 
@@ -2610,42 +2610,42 @@ static void nvme_remove_dead_ctrl_work(struct 
work_struct *work)
nvme_put_ctrl(>ctrl);
 }
 
-static int nvme_pci_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val)
+static int nvme_mmio_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val)
 {
*val = readl(to_nvme_dev(ctrl)->bar + off);
return 0;
 }
 
-static int nvme_pci_reg_write32(struct nvme_ctrl *ctrl, u32 off, u32 val)
+static int nvme_mmio_reg_write32(struct nvme_ctrl *ctrl, u32 off, u32 val)
 {
writel(val, to_nvme_dev(ctrl)->bar + off);
return 0;
 }
 
-static int nvme_pci_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val)
+static int nvme_mmio_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val)
 {
*val = readq(to_nvme_dev(ctrl)->bar + off);
return 0;
 }
 
-static int nvme_pci_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
+static int nvme_mmio_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
 {
struct pci_dev *pdev = to_pci_dev(to_nvme_dev(ctrl)->dev);
 
return snprintf(buf, size, "%s", dev_name(>dev));
 }
 
-static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
+static const struct nvme_ctrl_ops nvme_mmio_ctrl_ops = {
.name   = "pcie",
.module = THIS_MODULE,
.flags  = NVME_F_METADATA_SUPPORTED |
  NVME_F_PCI_P2PDMA,
-   .reg_read32 = nvme_pci_reg_read32,
-   .reg_write32= nvme_pci_reg_write32,
-   .reg_read64 = nvme_pci_reg_read64,
-   .free_ctrl  = nvme_pci_free_ctrl,
-   .submit_async_event = nvme_pci_submit_async_event,
-   .get_address= nvme_pci_get_address,
+   .reg_read32 = nvme_mmio_reg_read32,
+   .reg_write32= nvme_mmio_reg_write32,
+   .reg_read64 = nvme_mmio_reg_read64,
+   .free_ctrl  = nvme_mmio_free_ctrl,
+   .submit_async_event = nvme_mmio_submit_async_event,
+   .get_address= nvme_mmio_get_address,
 };
 
 static int nvme_dev_map(struct nvme_dev *dev)
@@ -2758,7 +2758,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct 
pci_device_id *id)
goto release_pools;
}
 
-   result = nvme_init_ctrl(>ctrl, >dev, _pci_ctrl_ops,
+   result = nvme_init_ctrl(>ctrl, >dev, _mmio_ctrl_ops,
quirks);
if (result)
goto release_mempool;
-- 
2.20.1

[PATCH v2 1/5] ahci: Discover Intel remapped NVMe devices

Intel SATA AHCI controllers support a strange mode where NVMe devices
disappear from the PCI bus, and instead are remapped into AHCI PCI memory
space.

Many current and upcoming consumer products ship with the AHCI controller
in this "RAID" or "Intel RST Premium with Intel Optane System Acceleration"
mode by default. Without Linux support for this remapped mode,
the default out-of-the-box experience is that the NVMe storage device
is inaccessible (which in many cases is the only internal storage device).

Using partial information provided by Intel in datasheets, emails,
and previous patches, extend the AHCI driver to detect the remapped NVMe
devices and create corresponding platform devices, to be picked up
by the nvme driver.

Our knowledge of the design and workings of this remapping scheme
has been collected in ahci-remap.h, which can be considered the best
spec we have at the moment.

Based on earlier work by Dan Williams.

Signed-off-by: Daniel Drake 
---
 drivers/ata/Kconfig|  32 +++
 drivers/ata/ahci.c | 173 -
 drivers/ata/ahci.h |  14 +++
 include/linux/ahci-remap.h | 140 +++---
 4 files changed, 329 insertions(+), 30 deletions(-)

diff --git a/drivers/ata/Kconfig b/drivers/ata/Kconfig
index a6beb2c5a692..6e82d66d7516 100644
--- a/drivers/ata/Kconfig
+++ b/drivers/ata/Kconfig
@@ -109,6 +109,38 @@ config SATA_MOBILE_LPM_POLICY
  Note "Minimum power" is known to cause issues, including disk
  corruption, with some disks and should not be used.
 
+config SATA_AHCI_INTEL_NVME_REMAP
+   bool "AHCI: Intel Remapped NVMe device support"
+   depends on SATA_AHCI
+   depends on BLK_DEV_NVME
+   help
+ Support access to remapped NVMe devices that appear in AHCI PCI
+ memory space.
+
+ You'll need this in order to access your NVMe storage if you are
+ running an Intel AHCI controller in "RAID" or "Intel RST Premium
+ with Intel Optane System Acceleration" mode. This is the default
+ configuration of many consumer products. If you have storage devices
+ being affected by this, you'll have noticed that such devices are
+ absent, and you'll see a warning in your kernel logs about remapped
+ NVMe devices.
+
+ Instead of enabling this option, it is recommended to go into the
+ BIOS menu and change the SATA device into "AHCI" mode in order to
+ gain access to the affected devices, while also enjoying all
+ available NVMe features and performance.
+
+ However, if you do want to access the NVMe devices in remapped
+ mode, say Y. Negative consequences of remapped device access
+ include:
+ - No NVMe device power management
+ - No NVMe reset support
+ - No NVMe quirks based on PCI ID
+ - No SR-IOV VFs
+ - Reduced performance through a shared, legacy interrupt
+
+ If unsure, say N.
+
 config SATA_AHCI_PLATFORM
tristate "Platform AHCI SATA support"
help
diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index f7652baa6337..b58316347539 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1499,11 +1500,11 @@ static irqreturn_t ahci_thunderx_irq_handler(int irq, 
void *dev_instance)
 }
 #endif
 
-static void ahci_remap_check(struct pci_dev *pdev, int bar,
+static int ahci_remap_init(struct pci_dev *pdev, int bar,
struct ahci_host_priv *hpriv)
 {
int i, count = 0;
-   u32 cap;
+   u32 supported_devs;
 
/*
 * Check if this device might have remapped nvme devices.
@@ -1511,33 +1512,68 @@ static void ahci_remap_check(struct pci_dev *pdev, int 
bar,
if (pdev->vendor != PCI_VENDOR_ID_INTEL ||
pci_resource_len(pdev, bar) < SZ_512K ||
bar != AHCI_PCI_BAR_STANDARD ||
-   !(readl(hpriv->mmio + AHCI_VSCAP) & 1))
-   return;
+   !(readl(hpriv->mmio + AHCI_VS_CAP) & AHCI_VS_CAP_NRMBE))
+   return -ENODEV;
 
-   cap = readq(hpriv->mmio + AHCI_REMAP_CAP);
-   for (i = 0; i < AHCI_MAX_REMAP; i++) {
-   if ((cap & (1 << i)) == 0)
+   supported_devs = readl(hpriv->mmio + AHCI_REMAP_RCR_L)
+& AHCI_REMAP_RCR_L_NRS_MASK;
+   for_each_set_bit(i, (unsigned long *)_devs, AHCI_MAX_REMAP) {
+   struct ahci_remap *rdev;
+   u32 dcc;
+
+   /* Check that the remapped device is NVMe */
+   dcc = readl(hpriv->mmio + ahci_remap_dcc(i));
+   if ((dcc & AHCI_REMAP_DCC_DT) != AHCI_REMAP_DCC_DT_NVME)
continue;
-   if (readl(hpriv->mmio + ahci_remap_dcc(i))
-   != PCI_CLASS_STORAGE_EXPRESS)
+
+   dcc &= AHCI_REMAP_DCC_CC_MASK;
+   if

[PATCH v2 3/5] nvme: introduce nvme_dev_ops

In preparation for a platform device nvme driver, move the bus specific
portions of nvme to nvme_dev_ops, or otherwise rewrite routines to use a
generic 'struct device' instead of 'struct pci_dev'.

Based on earlier work by Dan Williams.

Signed-off-by: Daniel Drake 
---
 drivers/nvme/host/pci.c | 410 +++-
 1 file changed, 275 insertions(+), 135 deletions(-)

 I took Dan William's earlier patch here and refreshed it for the
latest nvme driver, which has gained a few more places where it uses
the PCI device, so nvme_dev_ops grew a bit more.

Is this a suitable way of handling this case? It feels a little
unclean to have both the NVMe host layer and the PCI-specific dev ops
in the same file. Maybe it makes sense because NVMe is inherently a PCI
thing under normal circumstances? Or would it be cleaner for me to
rename "pci.c" to "mmio.c" and then separate the pci dev ops into
a new "pci.c"?

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 42990b93349d..23bda524f16b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -89,10 +89,51 @@ struct nvme_queue;
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
 static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode);
 
+struct nvme_dev_ops {
+   /* Enable device (required) */
+   int (*enable)(struct nvme_dev *dev);
+
+   /* Disable device (required) */
+   void (*disable)(struct nvme_dev *dev);
+
+   /* Allocate IRQ vectors for given number of io queues (required) */
+   int (*setup_irqs)(struct nvme_dev *dev, int nr_io_queues);
+
+   /* Get the IRQ vector for a specific queue */
+   int (*q_irq)(struct nvme_queue *q);
+
+   /* Allocate device-specific SQ command buffer (optional) */
+   int (*cmb_alloc_sq_cmds)(struct nvme_queue *nvmeq, size_t size,
+struct nvme_command **sq_cmds,
+dma_addr_t *sq_dma_addr);
+
+   /* Free device-specific SQ command buffer (optional) */
+   void (*cmb_free_sq_cmds)(struct nvme_queue *nvmeq,
+struct nvme_command *sq_cmds, size_t size);
+
+   /* Device-specific mapping of blk queues to CPUs (optional) */
+   int (*map_queues)(struct nvme_dev *dev, struct blk_mq_queue_map *map,
+ int offset);
+
+   /* Check if device is enabled on the bus (required) */
+   int (*is_enabled)(struct nvme_dev *dev);
+
+   /* Check if channel is in running state (required) */
+   int (*is_offline)(struct nvme_dev *dev);
+
+   /* Check if device is present and responding (optional) */
+   bool (*is_present)(struct nvme_dev *dev);
+
+   /* Check & log device state before it gets reset (optional) */
+   void (*warn_reset)(struct nvme_dev *dev);
+};
+
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
  */
 struct nvme_dev {
+   const struct resource *res;
+   const struct nvme_dev_ops *ops;
struct nvme_queue *queues;
struct blk_mq_tag_set tagset;
struct blk_mq_tag_set admin_tagset;
@@ -178,6 +219,7 @@ static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl 
*ctrl)
  */
 struct nvme_queue {
struct nvme_dev *dev;
+   char irqname[24];   /* nvme4294967295-65535\0 */
spinlock_t sq_lock;
struct nvme_command *sq_cmds;
 /* only used for poll queues: */
@@ -384,6 +426,11 @@ static unsigned int nvme_pci_iod_alloc_size(struct 
nvme_dev *dev,
return alloc_size + sizeof(struct scatterlist) * nseg;
 }
 
+static int nvme_pci_q_irq(struct nvme_queue *nvmeq)
+{
+   return pci_irq_vector(to_pci_dev(nvmeq->dev->dev), nvmeq->cq_vector);
+}
+
 static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
unsigned int hctx_idx)
 {
@@ -444,7 +491,14 @@ static int queue_irq_offset(struct nvme_dev *dev)
return 0;
 }
 
-static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
+static int nvme_pci_map_queues(struct nvme_dev *dev,
+  struct blk_mq_queue_map *map,
+  int offset)
+{
+   return blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset);
+}
+
+static int nvme_map_queues(struct blk_mq_tag_set *set)
 {
struct nvme_dev *dev = set->driver_data;
int i, qoff, offset;
@@ -464,8 +518,8 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
 * affinity), so use the regular blk-mq cpu mapping
 */
map->queue_offset = qoff;
-   if (i != HCTX_TYPE_POLL && offset)
-   blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), 
offset);
+   if (i != HCTX_TYPE_POLL && offset && dev->ops->map_queues)
+   dev->ops->map_queues(dev, map, offset);
else
blk_mq_map_queues(map);
qoff +=

[PATCH v2 0/5] Support Intel AHCI remapped NVMe devices

Intel SATA AHCI controllers support a strange mode where NVMe devices
disappear from the PCI bus, and instead are remapped into AHCI PCI memory
space.

Many current and upcoming consumer products ship with the AHCI controller
in this "RAID" or "Intel RST Premium with Intel Optane System Acceleration"
mode by default. Without Linux support for this remapped mode,
the default out-of-the-box experience is that the NVMe storage device
is inaccessible (which in many cases is the only internal storage device).

In most cases, the SATA configuration can be changed in the BIOS menu to
"AHCI", resulting in the AHCI & NVMe devices appearing as separate
devices as you would ordinarily expect. Changing this configuration
is the recommendation for power users because there are several limitations
of the remapped mode (now documented in Kconfig help text).

However, it's also important to support the remapped mode given that
it is an increasingly common product default. We cannot expect ordinary
users of consumer PCs to find out about this situation and then
confidently go into the BIOS menu to change options.

This patch set implements support for the remapped mode.

v1 of these patches was originally posted by Dan Williams in 2016.
https://marc.info/?l=linux-ide=147709610621480=2
Since then:

- Intel stopped developing these patches & hasn't been responding to
my emails on this topic.

- More register documentation appeared in

https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/300-series-chipset-pch-datasheet-vol-2.pdf

- I tried Christoph's suggestion of exposing the devices on a fake PCI bus,
hence not requiring NVMe driver changes, but Bjorn Helgaas does not think
it's the right approach and instead recommends the approach taken here.
https://marc.info/?l=linux-pci=156034736822205=2

- More consumer devices have appeared with this setting as the default,
and with the decreasing cost of NVMe storage, it appears that a whole
bunch more consumer PC products currently in development are going to
ship in RAID/remapped mode, with only a single NVMe disk, which Linux
will otherwise be unable to access by default.

- We heard from hardware vendors that this Linux incompatibility is
causing them to consider discontinuing Linux support on affected
products. Changing the BIOS setting is too much of a logistics
challenge.

- I updated Dan's patches for current kernels. I added docs and references
and incorporated the new register info. I incorporated feedback to push
the recommendation that the user goes back to AHCI mode via the BIOS
setting (in kernel logs and Kconfig help). And made some misc minor
changes that I think are sensible.

- I investigated MSI-X support. Can't quite get it working, but I'm hopeful
that we can figure it out and add it later. With these patches shared
I'll follow up with more details about that. With the focus on
compatibility with default configuration of common consumer products,
I'm hoping we could land an initial version without MSI support before
tending to those complications.

Dan Williams (2):
nvme: rename "pci" operations to "mmio"
nvme: move common definitions to pci.h

Daniel Drake (3):
ahci: Discover Intel remapped NVMe devices
nvme: introduce nvme_dev_ops
nvme: Intel AHCI remap support

drivers/ata/Kconfig | 33 ++
drivers/ata/ahci.c | 173 --
drivers/ata/ahci.h | 14 +
drivers/nvme/host/Kconfig| 3 +
drivers/nvme/host/Makefile | 3 +
drivers/nvme/host/intel-ahci-remap.c | 185 ++
drivers/nvme/host/pci.c | 490 ++-
drivers/nvme/host/pci.h | 145
include/linux/ahci-remap.h | 140 +++-
9 files changed, 922 insertions(+), 264 deletions(-)
create mode 100644 drivers/nvme/host/intel-ahci-remap.c
create mode 100644 drivers/nvme/host/pci.h

--
2.20.1

RE: [PATCH net-next v6 2/5] net: stmmac: introducing support for DWC xPCS logics

2019-06-19 Thread Ong, Boon Leong

>>From: Jose Abreu [mailto:jose.ab...@synopsys.com]
>>From: Florian Fainelli 
>>
>>> +Russell,
>>>
>>> On 6/4/2019 11:58 AM, Voon Weifeng wrote:
>>> > From: Ong Boon Leong 
>>> >
>>> > xPCS is DWC Ethernet Physical Coding Sublayer that may be integrated
>>> > into a GbE controller that uses DWC EQoS MAC controller. An example of
>>> > HW configuration is shown below:-
>>> >
>>> >   <-GBE Controller-->|<--External PHY chip-->
>>> >
>>> >   +--+ +++---+   +--+
>>> >   |   EQoS   | <-GMII->| DW |<-->|PHY| <-- SGMII --> | External GbE |
>>> >   |   MAC| |xPCS||IF |   | PHY Chip |
>>> >   +--+ +++---+   +--+
>>> >  ^   ^  ^
>>> >  |   |  |
>>> >  +-MDIO-+
>>> >
>>> > xPCS is a Clause-45 MDIO Manageable Device (MMD) and we need a way
>>to
>>> > differentiate it from external PHY chip that is discovered over MDIO.
>>> > Therefore, xpcs_phy_addr is introduced in stmmac platform data
>>> > (plat_stmmacenet_data) for differentiating xPCS from 'phy_addr' that
>>> > belongs to external PHY.
>>>
>>> Assuming this DW xPCS can be found with designs other than STMMAC
>>would
>>> not it make sense to model this as some kind of PHY/MDIO bridge? A
>>> little bit like what drivers/net/phy/xilinx_gmii2rgmii.c tries to do?
>>
>>Yes, DW XPCS is a separate IP that can be sold without the MAC.
>
>Hi Florian, thanks for pointing out the PHY driver for GMII to RGMII converter
>implementation. It seems like community would like dwxpcs to take the
>converter phy driver direction.
>
>We would like to check with community what is the MAC controller that is
>using above PHY driver so that we can dig deeper into the PHY & MAC driver
>architecture. We would like to map the existing usage of dwxpcs.c in 3/5 of
>this series is architecturally ready for PHY driver framework or new APIs
>would need to be defined.

Just to cycle-back to this track, we are working towards getting the ACPI device
ID for this IP. Meanwhile, since the C45 MDIO patych is also needed by 
Biao, we plan to line up the below patch for merge.

[PATCH net-next v6 1/5] net: stmmac: enable clause 45 mdio support

Is there any concern with this approach?

[PATCH RESEND 8/8] mm: Remove mmap_legacy_base and mmap_compat_legacy_code fields from mm_struct

Now that x86 and parisc do not use those fields anymore, remove them from
mm code.

Signed-off-by: Alexandre Ghiti 
---
 include/linux/mm_types.h | 2 --
 mm/debug.c   | 4 ++--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1d1093474c1a..9a5935f9cc7e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -364,11 +364,9 @@ struct mm_struct {
unsigned long pgoff, unsigned long flags);
 #endif
unsigned long mmap_base;/* base of mmap area */
-   unsigned long mmap_legacy_base; /* base of mmap area in 
bottom-up allocations */
 #ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
/* Base adresses for compatible mmap() */
unsigned long mmap_compat_base;
-   unsigned long mmap_compat_legacy_base;
 #endif
unsigned long task_size;/* size of task vm space */
unsigned long highest_vm_end;   /* highest vma end address */
diff --git a/mm/debug.c b/mm/debug.c
index 8345bb6e4769..3ddffe1efcda 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -134,7 +134,7 @@ void dump_mm(const struct mm_struct *mm)
 #ifdef CONFIG_MMU
"get_unmapped_area %px\n"
 #endif
-   "mmap_base %lu mmap_legacy_base %lu highest_vm_end %lu\n"
+   "mmap_base %lu highest_vm_end %lu\n"
"pgd %px mm_users %d mm_count %d pgtables_bytes %lu map_count 
%d\n"
"hiwater_rss %lx hiwater_vm %lx total_vm %lx locked_vm %lx\n"
"pinned_vm %llx data_vm %lx exec_vm %lx stack_vm %lx\n"
@@ -162,7 +162,7 @@ void dump_mm(const struct mm_struct *mm)
 #ifdef CONFIG_MMU
mm->get_unmapped_area,
 #endif
-   mm->mmap_base, mm->mmap_legacy_base, mm->highest_vm_end,
+   mm->mmap_base, mm->highest_vm_end,
mm->pgd, atomic_read(>mm_users),
atomic_read(>mm_count),
mm_pgtables_bytes(mm),
-- 
2.20.1

[PATCH RESEND 7/8] x86: Use mmap_base, not mmap_legacy_base, as low_limit for bottom-up mmap

Bottom-up mmap scheme is used twice:

- for legacy mode, in which mmap_legacy_base and mmap_compat_legacy_base
are respectively equal to mmap_base and mmap_compat_base.

- in case of mmap failure in top-down mode, where there is no need to go
through the whole address space again for the bottom-up fallback: the goal
of this fallback is to find, as a last resort, space between the top-down
mmap base and the stack, which is the only place not covered by the
top-down mmap.

Then this commit removes the usage of mmap_legacy_base and
mmap_compat_legacy_base fields from x86 code.

Signed-off-by: Alexandre Ghiti 
---
 arch/x86/include/asm/elf.h   |  2 +-
 arch/x86/kernel/sys_x86_64.c |  4 ++--
 arch/x86/mm/hugetlbpage.c|  4 ++--
 arch/x86/mm/mmap.c   | 20 +---
 4 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 69c0f892e310..bbfd81453250 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -307,7 +307,7 @@ static inline int mmap_is_ia32(void)
 
 extern unsigned long task_size_32bit(void);
 extern unsigned long task_size_64bit(int full_addr_space);
-extern unsigned long get_mmap_base(int is_legacy);
+extern unsigned long get_mmap_base(void);
 extern bool mmap_address_hint_valid(unsigned long addr, unsigned long len);
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index f7476ce23b6e..0bf8604bea5e 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -121,7 +121,7 @@ static void find_start_end(unsigned long addr, unsigned 
long flags,
return;
}
 
-   *begin  = get_mmap_base(1);
+   *begin  = get_mmap_base();
if (in_32bit_syscall())
*end = task_size_32bit();
else
@@ -211,7 +211,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const 
unsigned long addr0,
info.flags = VM_UNMAPPED_AREA_TOPDOWN;
info.length = len;
info.low_limit = PAGE_SIZE;
-   info.high_limit = get_mmap_base(0);
+   info.high_limit = get_mmap_base();
 
/*
 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 4b90339aef50..3a7f11e66114 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -86,7 +86,7 @@ static unsigned long 
hugetlb_get_unmapped_area_bottomup(struct file *file,
 
info.flags = 0;
info.length = len;
-   info.low_limit = get_mmap_base(1);
+   info.low_limit = get_mmap_base();
 
/*
 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
@@ -106,7 +106,7 @@ static unsigned long 
hugetlb_get_unmapped_area_topdown(struct file *file,
 {
struct hstate *h = hstate_file(file);
struct vm_unmapped_area_info info;
-   unsigned long mmap_base = get_mmap_base(0);
+   unsigned long mmap_base = get_mmap_base();
 
info.flags = VM_UNMAPPED_AREA_TOPDOWN;
info.length = len;
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index aae9a933dfd4..54c9ff301323 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -113,13 +113,12 @@ static unsigned long mmap_legacy_base(unsigned long rnd,
  * This function, called very early during the creation of a new
  * process VM image, sets up which VM layout function to use:
  */
-static void arch_pick_mmap_base(unsigned long *base, unsigned long 
*legacy_base,
+static void arch_pick_mmap_base(unsigned long *base,
unsigned long random_factor, unsigned long task_size,
struct rlimit *rlim_stack)
 {
-   *legacy_base = mmap_legacy_base(random_factor, task_size);
if (mmap_is_legacy())
-   *base = *legacy_base;
+   *base = mmap_legacy_base(random_factor, task_size);
else
*base = mmap_base(random_factor, task_size, rlim_stack);
 }
@@ -131,7 +130,7 @@ void arch_pick_mmap_layout(struct mm_struct *mm, struct 
rlimit *rlim_stack)
else
mm->get_unmapped_area = arch_get_unmapped_area_topdown;
 
-   arch_pick_mmap_base(>mmap_base, >mmap_legacy_base,
+   arch_pick_mmap_base(>mmap_base,
arch_rnd(mmap64_rnd_bits), task_size_64bit(0),
rlim_stack);
 
@@ -142,23 +141,22 @@ void arch_pick_mmap_layout(struct mm_struct *mm, struct 
rlimit *rlim_stack)
 * applications and 32bit applications. The 64bit syscall uses
 * mmap_base, the compat syscall uses mmap_compat_base.
 */
-   arch_pick_mmap_base(>mmap_compat_base, >mmap_compat_legacy_base,
+   arch_pick_mmap_base(>mmap_compat_base,
arch_rnd(mmap32_rnd_bits), task_size_32bit(),
rlim_stack);
 #endif
 }
 
-unsigned long get_mmap_base(int is_legacy)
+unsigned long get_mmap_base(void)
 {
struct mm_struct *mm =

[PATCH RESEND 6/8] parisc: Use mmap_base, not mmap_legacy_base, as low_limit for bottom-up mmap

Bottom-up mmap scheme is used twice:

- for legacy mode, in which mmap_legacy_base and mmap_base are equal.

- in case of mmap failure in top-down mode, where there is no need to go
through the whole address space again for the bottom-up fallback: the goal
of this fallback is to find, as a last resort, space between the top-down
mmap base and the stack, which is the only place not covered by the
top-down mmap.

Then this commit removes the usage of mmap_legacy_base field from parisc
code.

Signed-off-by: Alexandre Ghiti 
---
 arch/parisc/kernel/sys_parisc.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/parisc/kernel/sys_parisc.c b/arch/parisc/kernel/sys_parisc.c
index 5d458a44b09c..e987f3a8eb0b 100644
--- a/arch/parisc/kernel/sys_parisc.c
+++ b/arch/parisc/kernel/sys_parisc.c
@@ -119,7 +119,7 @@ unsigned long arch_get_unmapped_area(struct file *filp, 
unsigned long addr,
 
info.flags = 0;
info.length = len;
-   info.low_limit = mm->mmap_legacy_base;
+   info.low_limit = mm->mmap_base;
info.high_limit = mmap_upper_limit(NULL);
info.align_mask = last_mmap ? (PAGE_MASK & (SHM_COLOUR - 1)) : 0;
info.align_offset = shared_align_offset(last_mmap, pgoff);
@@ -240,13 +240,11 @@ static unsigned long mmap_legacy_base(void)
  */
 void arch_pick_mmap_layout(struct mm_struct *mm, struct rlimit *rlim_stack)
 {
-   mm->mmap_legacy_base = mmap_legacy_base();
-   mm->mmap_base = mmap_upper_limit(rlim_stack);
-
if (mmap_is_legacy()) {
-   mm->mmap_base = mm->mmap_legacy_base;
+   mm->mmap_base = mmap_legacy_base();
mm->get_unmapped_area = arch_get_unmapped_area;
} else {
+   mm->mmap_base = mmap_upper_limit(rlim_stack);
mm->get_unmapped_area = arch_get_unmapped_area_topdown;
}
 }
-- 
2.20.1

[PATCH RESEND 5/8] mm: Start fallback top-down mmap at mm->mmap_base

In case of mmap failure in top-down mode, there is no need to go through
the whole address space again for the bottom-up fallback: the goal of this
fallback is to find, as a last resort, space between the top-down mmap base
and the stack, which is the only place not covered by the top-down mmap.

Signed-off-by: Alexandre Ghiti 
---
 mm/mmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index dedae10cb6e2..e563145c1ff4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2185,7 +2185,7 @@ arch_get_unmapped_area_topdown(struct file *filp, 
unsigned long addr,
if (offset_in_page(addr)) {
VM_BUG_ON(addr != -ENOMEM);
info.flags = 0;
-   info.low_limit = TASK_UNMAPPED_BASE;
+   info.low_limit = arch_get_mmap_base(addr, mm->mmap_base);
info.high_limit = mmap_end;
addr = vm_unmapped_area();
}
-- 
2.20.1

Re: [PATCH 1/1] staging: media: fix style problem

2019-06-19 Thread Nathan Chancellor

On Thu, Jun 20, 2019 at 10:32:48AM +0530, Aliasgar Surti wrote:
> From: Aliasgar Surti 
> 
> checkpatch reported "WARNING: line over 80 characters".
> This patch fixes the warning for file davinci_vpfe/dm365_isif.c
> 
> Signed-off-by: Aliasgar Surti 
> ---
>  drivers/staging/media/davinci_vpfe/dm365_isif.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/staging/media/davinci_vpfe/dm365_isif.c 
> b/drivers/staging/media/davinci_vpfe/dm365_isif.c
> index 46fd818..12bdf91 100644
> --- a/drivers/staging/media/davinci_vpfe/dm365_isif.c
> +++ b/drivers/staging/media/davinci_vpfe/dm365_isif.c
> @@ -532,7 +532,8 @@ static int isif_validate_dfc_params(const struct 
> vpfe_isif_dfc *dfc)
>  #define DM365_ISIF_MAX_CLVSV 0x1fff
>  #define DM365_ISIF_MAX_HEIGHT_BLACK_REGION   0x1fff
>  
> -static int isif_validate_bclamp_params(const struct vpfe_isif_black_clamp 
> *bclamp)
> +static int isif_validate_bclamp_params(const struct vpfe_isif_black_clamp
> +*bclamp)

I think

static int
isif_validate_bclamp_params(const struct vpfe_isif_black_clamp *bclamp)

is a better choice for this change.

Cheers,
Nathan

>  {
>   int err = -EINVAL;
>  
> @@ -593,7 +594,8 @@ isif_validate_raw_params(const struct 
> vpfe_isif_raw_config *params)
>   return isif_validate_bclamp_params(>bclamp);
>  }
>  
> -static int isif_set_params(struct v4l2_subdev *sd, const struct 
> vpfe_isif_raw_config *params)
> +static int isif_set_params(struct v4l2_subdev *sd,
> +const struct vpfe_isif_raw_config *params)
>  {
>   struct vpfe_isif_device *isif = v4l2_get_subdevdata(sd);
>   int ret = -EINVAL;
> -- 
> 2.7.4
>

[PATCH RESEND 3/8] sparc: Start fallback of top-down mmap at mm->mmap_base

In case of mmap failure in top-down mode, there is no need to go through
the whole address space again for the bottom-up fallback: the goal of this
fallback is to find, as a last resort, space between the top-down mmap base
and the stack, which is the only place not covered by the top-down mmap.

Signed-off-by: Alexandre Ghiti 
---
 arch/sparc/kernel/sys_sparc_64.c | 2 +-
 arch/sparc/mm/hugetlbpage.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/sparc/kernel/sys_sparc_64.c b/arch/sparc/kernel/sys_sparc_64.c
index ccc88926bc00..ea1de1e5fa8d 100644
--- a/arch/sparc/kernel/sys_sparc_64.c
+++ b/arch/sparc/kernel/sys_sparc_64.c
@@ -206,7 +206,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const 
unsigned long addr0,
if (addr & ~PAGE_MASK) {
VM_BUG_ON(addr != -ENOMEM);
info.flags = 0;
-   info.low_limit = TASK_UNMAPPED_BASE;
+   info.low_limit = mm->mmap_base;
info.high_limit = STACK_TOP32;
addr = vm_unmapped_area();
}
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index f78793a06bbd..9c67f805abc8 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -86,7 +86,7 @@ hugetlb_get_unmapped_area_topdown(struct file *filp, const 
unsigned long addr0,
if (addr & ~PAGE_MASK) {
VM_BUG_ON(addr != -ENOMEM);
info.flags = 0;
-   info.low_limit = TASK_UNMAPPED_BASE;
+   info.low_limit = mm->mmap_base;
info.high_limit = STACK_TOP32;
addr = vm_unmapped_area();
}
-- 
2.20.1

[PATCH] KVM: vmx: Fix the broken usage of vmx_xsaves_supported

2019-06-19 Thread Tao Xu

The helper vmx_xsaves_supported() returns the bit value of
SECONDARY_EXEC_XSAVES in vmcs_config.cpu_based_2nd_exec_ctrl, which
remains unchanged true if vmcs supports 1-setting of this bit after
setup_vmcs_config(). It should check the guest's cpuid not this
unchanged value when get/set msr.

Besides, vmx_compute_secondary_exec_control() adjusts
SECONDARY_EXEC_XSAVES bit based on guest cpuid's X86_FEATURE_XSAVE
and X86_FEATURE_XSAVES, it should use updated value to decide whether
set XSS_EXIT_BITMAP.

Co-developed-by: Xiaoyao Li 
Signed-off-by: Xiaoyao Li 
Signed-off-by: Tao Xu 
---
 arch/x86/kvm/vmx/vmx.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b93e36ddee5e..935cf72439a9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1721,7 +1721,8 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
return vmx_get_vmx_msr(>nested.msrs, msr_info->index,
   _info->data);
case MSR_IA32_XSS:
-   if (!vmx_xsaves_supported())
+   if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVE) ||
+   !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
return 1;
msr_info->data = vcpu->arch.ia32_xss;
break;
@@ -1935,7 +1936,8 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
return 1;
return vmx_set_vmx_msr(vcpu, msr_index, data);
case MSR_IA32_XSS:
-   if (!vmx_xsaves_supported())
+   if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVE) ||
+   !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
return 1;
/*
 * The only supported bit as of Skylake is bit 8, but
@@ -4094,7 +4096,7 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx)
 
set_cr4_guest_host_mask(vmx);
 
-   if (vmx_xsaves_supported())
+   if (vmx->secondary_exec_control & SECONDARY_EXEC_XSAVES)
vmcs_write64(XSS_EXIT_BITMAP, VMX_XSS_EXIT_BITMAP);
 
if (enable_pml) {
-- 
2.20.1

[PATCH RESEND 2/8] sh: Start fallback of top-down mmap at mm->mmap_base

In case of mmap failure in top-down mode, there is no need to go through
the whole address space again for the bottom-up fallback: the goal of this
fallback is to find, as a last resort, space between the top-down mmap base
and the stack, which is the only place not covered by the top-down mmap.

Signed-off-by: Alexandre Ghiti 
---
 arch/sh/mm/mmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/sh/mm/mmap.c b/arch/sh/mm/mmap.c
index 6a1a1297baae..4c7da92473dd 100644
--- a/arch/sh/mm/mmap.c
+++ b/arch/sh/mm/mmap.c
@@ -135,7 +135,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const 
unsigned long addr0,
if (addr & ~PAGE_MASK) {
VM_BUG_ON(addr != -ENOMEM);
info.flags = 0;
-   info.low_limit = TASK_UNMAPPED_BASE;
+   info.low_limit = mm->mmap_base;
info.high_limit = TASK_SIZE;
addr = vm_unmapped_area();
}
-- 
2.20.1

[PATCH RESEND 1/8] s390: Start fallback of top-down mmap at mm->mmap_base

In case of mmap failure in top-down mode, there is no need to go through
the whole address space again for the bottom-up fallback: the goal of this
fallback is to find, as a last resort, space between the top-down mmap base
and the stack, which is the only place not covered by the top-down mmap.

Signed-off-by: Alexandre Ghiti 
---
 arch/s390/mm/mmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/s390/mm/mmap.c b/arch/s390/mm/mmap.c
index cbc718ba6d78..4a222969843b 100644
--- a/arch/s390/mm/mmap.c
+++ b/arch/s390/mm/mmap.c
@@ -166,7 +166,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const 
unsigned long addr0,
if (addr & ~PAGE_MASK) {
VM_BUG_ON(addr != -ENOMEM);
info.flags = 0;
-   info.low_limit = TASK_UNMAPPED_BASE;
+   info.low_limit = mm->mmap_base;
info.high_limit = TASK_SIZE;
addr = vm_unmapped_area();
if (addr & ~PAGE_MASK)
-- 
2.20.1

[PATCH 1/1] staging: media: fix style problem

2019-06-19 Thread Aliasgar Surti

From: Aliasgar Surti 

checkpatch reported "WARNING: line over 80 characters".
This patch fixes the warning for file davinci_vpfe/dm365_isif.c

Signed-off-by: Aliasgar Surti 
---
 drivers/staging/media/davinci_vpfe/dm365_isif.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/media/davinci_vpfe/dm365_isif.c 
b/drivers/staging/media/davinci_vpfe/dm365_isif.c
index 46fd818..12bdf91 100644
--- a/drivers/staging/media/davinci_vpfe/dm365_isif.c
+++ b/drivers/staging/media/davinci_vpfe/dm365_isif.c
@@ -532,7 +532,8 @@ static int isif_validate_dfc_params(const struct 
vpfe_isif_dfc *dfc)
 #define DM365_ISIF_MAX_CLVSV   0x1fff
 #define DM365_ISIF_MAX_HEIGHT_BLACK_REGION 0x1fff
 
-static int isif_validate_bclamp_params(const struct vpfe_isif_black_clamp 
*bclamp)
+static int isif_validate_bclamp_params(const struct vpfe_isif_black_clamp
+  *bclamp)
 {
int err = -EINVAL;
 
@@ -593,7 +594,8 @@ isif_validate_raw_params(const struct vpfe_isif_raw_config 
*params)
return isif_validate_bclamp_params(>bclamp);
 }
 
-static int isif_set_params(struct v4l2_subdev *sd, const struct 
vpfe_isif_raw_config *params)
+static int isif_set_params(struct v4l2_subdev *sd,
+  const struct vpfe_isif_raw_config *params)
 {
struct vpfe_isif_device *isif = v4l2_get_subdevdata(sd);
int ret = -EINVAL;
-- 
2.7.4

Re: [PATCH v1 1/4] mm: introduce MADV_COLD

2019-06-19 Thread Minchan Kim

On Wed, Jun 19, 2019 at 01:13:40PM -0400, Joel Fernandes wrote:
< snip >

Ccing Vladimir

> > > > > > +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> > > > > > +   unsigned long end, struct mm_walk *walk)
> > > > > > +{
> > > > > > +   pte_t *orig_pte, *pte, ptent;
> > > > > > +   spinlock_t *ptl;
> > > > > > +   struct page *page;
> > > > > > +   struct vm_area_struct *vma = walk->vma;
> > > > > > +   unsigned long next;
> > > > > > +
> > > > > > +   next = pmd_addr_end(addr, end);
> > > > > > +   if (pmd_trans_huge(*pmd)) {
> > > > > > +   ptl = pmd_trans_huge_lock(pmd, vma);
> > > > > > +   if (!ptl)
> > > > > > +   return 0;
> > > > > > +
> > > > > > +   if (is_huge_zero_pmd(*pmd))
> > > > > > +   goto huge_unlock;
> > > > > > +
> > > > > > +   page = pmd_page(*pmd);
> > > > > > +   if (page_mapcount(page) > 1)
> > > > > > +   goto huge_unlock;
> > > > > > +
> > > > > > +   if (next - addr != HPAGE_PMD_SIZE) {
> > > > > > +   int err;
> > > > > > +
> > > > > > +   get_page(page);
> > > > > > +   spin_unlock(ptl);
> > > > > > +   lock_page(page);
> > > > > > +   err = split_huge_page(page);
> > > > > > +   unlock_page(page);
> > > > > > +   put_page(page);
> > > > > > +   if (!err)
> > > > > > +   goto regular_page;
> > > > > > +   return 0;
> > > > > > +   }
> > > > > > +
> > > > > > +   pmdp_test_and_clear_young(vma, addr, pmd);
> > > > > > +   deactivate_page(page);
> > > > > > +huge_unlock:
> > > > > > +   spin_unlock(ptl);
> > > > > > +   return 0;
> > > > > > +   }
> > > > > > +
> > > > > > +   if (pmd_trans_unstable(pmd))
> > > > > > +   return 0;
> > > > > > +
> > > > > > +regular_page:
> > > > > > +   orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, );
> > > > > > +   for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> > > > > > +   ptent = *pte;
> > > > > > +
> > > > > > +   if (pte_none(ptent))
> > > > > > +   continue;
> > > > > > +
> > > > > > +   if (!pte_present(ptent))
> > > > > > +   continue;
> > > > > > +
> > > > > > +   page = vm_normal_page(vma, addr, ptent);
> > > > > > +   if (!page)
> > > > > > +   continue;
> > > > > > +
> > > > > > +   if (page_mapcount(page) > 1)
> > > > > > +   continue;
> > > > > > +
> > > > > > +   ptep_test_and_clear_young(vma, addr, pte);
> > > > > 
> > > > > Wondering here how it interacts with idle page tracking. Here since 
> > > > > young
> > > > > flag is cleared by the cold hint, page_referenced_one() or
> > > > > page_idle_clear_pte_refs_one() will not be able to clear the 
> > > > > page-idle flag
> > > > > if it was previously set since it does not know any more that a page 
> > > > > was
> > > > > actively referenced.
> > > > 
> > > > ptep_test_and_clear_young doesn't change PG_idle/young so idle page 
> > > > tracking
> > > > doesn't affect.
> > 
> > You said *young flag* in the comment, which made me confused. I thought you 
> > meant
> > PG_young flag but you mean PTE access bit.
> > 
> > > 
> > > Clearing of the young bit in the PTE does affect idle tracking.
> > > 
> > > Both page_referenced_one() and page_idle_clear_pte_refs_one() check this 
> > > bit.
> > > 
> > > > > bit was previously set, just so that page-idle tracking works 
> > > > > smoothly when
> > > > > this hint is concurrently applied?
> > > > 
> > > > deactivate_page will remove PG_young bit so that the page will be 
> > > > reclaimed.
> > > > Do I miss your point?
> > > 
> > > Say a process had accessed PTE bit not set, then idle tracking is run and 
> > > PG_Idle
> > > is set. Now the page is accessed from userspace thus setting the accessed 
> > > PTE
> > > bit.  Now a remote process passes this process_madvise cold hint (I know 
> > > your
> > > current series does not support remote process, but I am saying for future
> > > when you post this). Because you cleared the PTE accessed bit through the
> > > hint, idle tracking no longer will know that the page is referenced and 
> > > the
> > > user gets confused because accessed page appears to be idle.
> > 
> > Right.
> > 
> > > 
> > > I think to fix this, what you should do is clear the PG_Idle flag if the
> > > young/accessed PTE bits are set. If PG_Idle is already cleared, then you
> > > don't need to do anything.
> > 
> > I'm not sure. What does it make MADV_COLD special?
> > How about MADV_FREE|MADV_DONTNEED?
> > Why don't they clear PG_Idle if pte was young at tearing down pte? 
> 
> Good point, so it sounds like those (MADV_FREE|MADV_DONTNEED) also need to be 
> fixed then?

Not sure. If you want it, maybe you need to fix every pte

Re: [PATCH V9] i2c: tegra: remove BUG() macro

2019-06-19 Thread Bitan Biswas





On 6/18/19 4:09 AM, Bitan Biswas wrote:

The usage of BUG() macro is generally discouraged in kernel, unless
it's a problem that results in a physical damage or loss of data.
This patch removes unnecessary BUG() macros and replaces the rest
with warning.

Signed-off-by: Bitan Biswas 
---
  drivers/i2c/busses/i2c-tegra.c | 47 +++---
  1 file changed, 39 insertions(+), 8 deletions(-)

diff --git a/drivers/i2c/busses/i2c-tegra.c b/drivers/i2c/busses/i2c-tegra.c
index 4dfb4c1..e9ff96d 100644
--- a/drivers/i2c/busses/i2c-tegra.c
+++ b/drivers/i2c/busses/i2c-tegra.c
@@ -73,6 +73,7 @@
  #define I2C_ERR_NO_ACKBIT(0)
  #define I2C_ERR_ARBITRATION_LOST  BIT(1)
  #define I2C_ERR_UNKNOWN_INTERRUPT BIT(2)
+#define I2C_ERR_RX_BUFFER_OVERFLOW BIT(3)
  
  #define PACKET_HEADER0_HEADER_SIZE_SHIFT	28

  #define PACKET_HEADER0_PACKET_ID_SHIFT16
@@ -489,6 +490,13 @@ static int tegra_i2c_empty_rx_fifo(struct tegra_i2c_dev 
*i2c_dev)
size_t buf_remaining = i2c_dev->msg_buf_remaining;
int words_to_transfer;
  
+	/*

+* Catch overflow due to message fully sent
+* before the check for RX FIFO availability.
+*/
+   if (WARN_ON_ONCE(!(i2c_dev->msg_buf_remaining)))
+   return -EINVAL;
+
if (i2c_dev->hw->has_mst_fifo) {
val = i2c_readl(i2c_dev, I2C_MST_FIFO_STATUS);
rx_fifo_avail = (val & I2C_MST_FIFO_STATUS_RX_MASK) >>
@@ -515,7 +523,11 @@ static int tegra_i2c_empty_rx_fifo(struct tegra_i2c_dev 
*i2c_dev)
 * prevent overwriting past the end of buf
 */
if (rx_fifo_avail > 0 && buf_remaining > 0) {
-   BUG_ON(buf_remaining > 3);
+   /*
+* buf_remaining > 3 check not needed as rx_fifo_avail == 0
+* when (words_to_transfer was > rx_fifo_avail) earlier
+* in this function.
+*/
val = i2c_readl(i2c_dev, I2C_RX_FIFO);
val = cpu_to_le32(val);
memcpy(buf, , buf_remaining);
@@ -523,7 +535,10 @@ static int tegra_i2c_empty_rx_fifo(struct tegra_i2c_dev 
*i2c_dev)
rx_fifo_avail--;
}
  
-	BUG_ON(rx_fifo_avail > 0 && buf_remaining > 0);

+   /* RX FIFO must be drained, otherwise it's an Overflow case. */
+   if (WARN_ON_ONCE(rx_fifo_avail))
+   return -EINVAL;
+
i2c_dev->msg_buf_remaining = buf_remaining;
i2c_dev->msg_buf = buf;
  
@@ -581,7 +596,11 @@ static int tegra_i2c_fill_tx_fifo(struct tegra_i2c_dev *i2c_dev)

 * boundary and fault.
 */
if (tx_fifo_avail > 0 && buf_remaining > 0) {
-   BUG_ON(buf_remaining > 3);
+   /*
+* buf_remaining > 3 check not needed as tx_fifo_avail == 0
+* when (words_to_transfer was > tx_fifo_avail) earlier
+* in this function for non-zero words_to_transfer.
+*/
memcpy(, buf, buf_remaining);
val = le32_to_cpu(val);
  
@@ -847,10 +866,15 @@ static irqreturn_t tegra_i2c_isr(int irq, void *dev_id)
  
  	if (!i2c_dev->is_curr_dma_xfer) {

if (i2c_dev->msg_read && (status & I2C_INT_RX_FIFO_DATA_REQ)) {
-   if (i2c_dev->msg_buf_remaining)
-   tegra_i2c_empty_rx_fifo(i2c_dev);
-   else
-   BUG();
+   if (tegra_i2c_empty_rx_fifo(i2c_dev)) {
+   /*
+* Overflow error condition: message fully sent,
+* with no XFER_COMPLETE interrupt but hardware
+* asks to transfer more.
+*/
+   i2c_dev->msg_err |= I2C_ERR_RX_BUFFER_OVERFLOW;
+   goto err;
+   }
}
  
  		if (!i2c_dev->msg_read && (status & I2C_INT_TX_FIFO_DATA_REQ)) {

@@ -876,7 +900,14 @@ static irqreturn_t tegra_i2c_isr(int irq, void *dev_id)
if (status & I2C_INT_PACKET_XFER_COMPLETE) {
if (i2c_dev->is_curr_dma_xfer)
i2c_dev->msg_buf_remaining = 0;
-   BUG_ON(i2c_dev->msg_buf_remaining);
+   /*
+* Underflow error condition: XFER_COMPLETE before message
+* fully sent.
+*/
+   if (WARN_ON_ONCE(i2c_dev->msg_buf_remaining)) {
+   i2c_dev->msg_err |= I2C_ERR_UNKNOWN_INTERRUPT;
+   goto err;
+   }
complete(_dev->msg_complete);
}
goto done;



Please get back if there are any further comments regarding this patch.

-regards,
 Bitan

Re: [PATCH] media: Clarify the meaning of file descriptors in VIDIOC_DQBUF

2019-06-19 Thread Alexandre Courbot

On Wed, Jun 12, 2019 at 6:36 PM Tomasz Figa  wrote:
>
> When the application calls VIDIOC_DQBUF with the DMABUF memory type, the
> v4l2_buffer structure (or v4l2_plane structures) are filled with DMA-buf
> file descriptors. However, the current documentation does not explain
> whether those are new file descriptors referring to the same DMA-bufs or
> just the same integers as passed to VIDIOC_QBUF back in time. Clarify
> the documentation that it's the latter.
>
> Signed-off-by: Tomasz Figa 

That's a welcome precision indeed.

Reviewed-by: Alexandre Courbot 

> ---
>  Documentation/media/uapi/v4l/vidioc-qbuf.rst | 8 
>  1 file changed, 8 insertions(+)
>
> diff --git a/Documentation/media/uapi/v4l/vidioc-qbuf.rst 
> b/Documentation/media/uapi/v4l/vidioc-qbuf.rst
> index dbf7b445a27b..407302d80684 100644
> --- a/Documentation/media/uapi/v4l/vidioc-qbuf.rst
> +++ b/Documentation/media/uapi/v4l/vidioc-qbuf.rst
> @@ -139,6 +139,14 @@ may continue as normal, but should be aware that data in 
> the dequeued
>  buffer might be corrupted. When using the multi-planar API, the planes
>  array must be passed in as well.
>
> +If the application sets the ``memory`` field to ``V4L2_MEMORY_DMABUF`` to
> +dequeue a :ref:`DMABUF ` buffer, the driver fills the ``m.fd`` field
> +with a file descriptor numerically the same as the one given to 
> ``VIDIOC_QBUF``
> +when the buffer was enqueued. No new file descriptor is created at dequeue 
> time
> +and the value is only for the application convenience. When the multi-planar
> +API is used the ``m.fd`` fields of the passed array of struct
> +:c:type:`v4l2_plane` are filled instead.
> +
>  By default ``VIDIOC_DQBUF`` blocks when no buffer is in the outgoing
>  queue. When the ``O_NONBLOCK`` flag was given to the
>  :ref:`open() ` function, ``VIDIOC_DQBUF`` returns
> --
> 2.22.0.rc2.383.gf4fbbf30c2-goog
>

Re: [PATCH 05/13] vfs: don't parse "silent" option

2019-06-19 Thread Ian Kent

On Wed, 2019-06-19 at 14:30 +0200, Miklos Szeredi wrote:
> While this is a standard option as documented in mount(8), it is ignored by
> most filesystems.  So reject, unless filesystem explicitly wants to handle
> it.
> 
> The exception is unconverted filesystems, where it is unknown if the
> filesystem handles this or not.
> 
> Any implementation, such as mount(8) that needs to parse this option
> without failing should simply ignore the return value from fsconfig().

In theory this is fine but every time someone has attempted
to change the handling of this in the past autofs has had
problems so I'm a bit wary of the change.

It was originally meant to tell the file system to ignore
invalid options such as could be found in automount maps that
are used with multiple OS implementations that have differences
in their options.

That was, IIRC, primarily NFS although NFS should handle most
(if not all of those) cases these days.

Nevertheless I'm a bit nervous about it, ;)

> 
> Signed-off-by: Miklos Szeredi 
> ---
>  fs/fs_context.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fs_context.c b/fs/fs_context.c
> index 49636e541293..c26b353aa858 100644
> --- a/fs/fs_context.c
> +++ b/fs/fs_context.c
> @@ -51,7 +51,6 @@ static const struct constant_table common_clear_sb_flag[] =
> {
>   { "nolazytime", SB_LAZYTIME },
>   { "nomand", SB_MANDLOCK },
>   { "rw", SB_RDONLY },
> - { "silent", SB_SILENT },
>  };
>  
>  /*
> @@ -535,6 +534,9 @@ static int legacy_parse_param(struct fs_context *fc,
> struct fs_parameter *param)
>   if (ret != -ENOPARAM)
>   return ret;
>  
> + if (strcmp(param->key, "silent") == 0)
> + fc->sb_flags |= SB_SILENT;
> +
>   if (strcmp(param->key, "source") == 0) {
>   if (param->type != fs_value_is_string)
>   return invalf(fc, "VFS: Legacy: Non-string source");

Re: [PATCH][bpf] bpf: verifier: add break statement in switch

2019-06-19 Thread Alexei Starovoitov

On Wed, Jun 19, 2019 at 9:02 AM Gustavo A. R. Silva
 wrote:
>
> Notice that in this case, it's much clearer to explicitly add a break
> rather than letting the code to fall through. It also avoid potential
> future fall-through warnings[1].
>
> This patch is part of the ongoing efforts to enable
> -Wimplicit-fallthrough.
>
> [1] https://lore.kernel.org/patchwork/patch/1087056/
>
> Signed-off-by: Gustavo A. R. Silva 

this type of changes are not suitable for bpf tree.
Pls submit both as single patch to bpf-next

linux-next: Signed-off-by missing for commit in the afs tree

Hi David,

Commit

  0b8f4f05f41a ("afs: Add some callback tracepoints")

is missing a Signed-off-by from its author and committer.

-- 
Cheers,
Stephen Rothwell


pgppOKf05bf8B.pgp
Description: OpenPGP digital signature

Re: adding some trees to linux-next

Hi David,

On Wed, 19 Jun 2019 16:09:01 +0100 David Howells  wrote:
>
> Could you add my keys-next and afs-next branches to linux-next?  They can be
> found here:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git#keys-next
> git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git#afs-next

Added from today.

Thanks for adding your subsystem tree as a participant of linux-next.  As
you may know, this is not a judgement of your code.  The purpose of
linux-next is for integration testing and to lower the impact of
conflicts between subsystems in the next merge window. 

You will need to ensure that the patches/commits in your tree/series have
been:
 * submitted under GPL v2 (or later) and include the Contributor's
Signed-off-by,
 * posted to the relevant mailing list,
 * reviewed by you (or another maintainer of your subsystem tree),
 * successfully unit tested, and 
 * destined for the current or next Linux merge window.

Basically, this should be just what you would send to Linus (or ask him
to fetch).  It is allowed to be rebased if you deem it necessary.

-- 
Cheers,
Stephen Rothwell 
s...@canb.auug.org.au

pgpvLVzrvukUL.pgp
Description: OpenPGP digital signature

Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver

2019-06-19 Thread Alex Williamson

On Sat,  8 Jun 2019 21:21:11 +0800
Liu Yi L  wrote:

> This patch adds sample driver named vfio-mdev-pci. It is to wrap
> a PCI device as a mediated device. For a pci device, once bound
> to vfio-mdev-pci driver, user space access of this device will
> go through vfio mdev framework. The usage of the device follows
> mdev management method. e.g. user should create a mdev before
> exposing the device to user-space.
> 
> Benefit of this new driver would be acting as a sample driver
> for recent changes from "vfio/mdev: IOMMU aware mediated device"
> patchset. Also it could be a good experiment driver for future
> device specific mdev migration support.
> 
> To use this driver:
> a) build and load vfio-mdev-pci.ko module
>execute "make menuconfig" and config CONFIG_SAMPLE_VFIO_MDEV_PCI
>then load it with following command
>> sudo modprobe vfio
>> sudo modprobe vfio-pci
>> sudo insmod drivers/vfio/pci/vfio-mdev-pci.ko  
> 
> b) unbind original device driver
>e.g. use following command to unbind its original driver
>> echo $dev_bdf > /sys/bus/pci/devices/$dev_bdf/driver/unbind  
> 
> c) bind vfio-mdev-pci driver to the physical device
>> echo $vend_id $dev_id > /sys/bus/pci/drivers/vfio-mdev-pci/new_id  
> 
> d) check the supported mdev instances
>> ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/  
>  vfio-mdev-pci-type1
>> ls /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\  
>  vfio-mdev-pci-type1/
>  available_instances  create  device_api  devices  name

I think the static type name here is a problem (and why does it
include "type1"?).  We generally consider that a type defines a
software compatible mdev, but in this case any PCI device wrapped in
vfio-mdev-pci gets the same mdev type.  This is only a sample driver,
but that's a bad precedent.  I've taken a stab at fixing this in the
patch below, using the PCI vendor ID, device ID, subsystem vendor ID,
subsystem device ID, class code, and revision to try to make the type
as specific to the physical device assigned as we can through PCI.

> 
> e)  create mdev on this physical device (only 1 instance)
>> echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1003" > \  
>  /sys/bus/pci/devices/$dev_bdf/mdev_supported_types/\
>  vfio-mdev-pci-type1/create

Whoops, available_instances always reports 1 and it doesn't appear that
the create function prevents additional mdevs.  Also addressed in the
patch below.

> f) passthru the mdev to guest
>add the following line in Qemu boot command
>-device vfio-pci,\
> sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003
> 
> g) destroy mdev
>> echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1003/\  
>  remove
> 

I also found that unbinding the parent device doesn't unregister with
mdev, so it cannot be bound again, also fixed below.

However, the patch below just makes the mdev interface behave
correctly, I can't make it work on my system because commit
7bd50f0cd2fd ("vfio/type1: Add domain at(de)taching group helpers")
used iommu_attach_device() rather than iommu_attach_group() for non-aux
mdev iommu_device.  Is there a requirement that the mdev parent device
is in a singleton iommu group?  If this is a simplification, then
vfio-mdev-pci should not bind to devices where this is violated since
there's no way to use the device.  Can we support it though?

If I have two devices in the same group and bind them both to
vfio-mdev-pci, I end up with three groups, one for each mdev device and
the original physical device group.  vfio.c works with the mdev groups
and will try to match both groups to the container.  vfio_iommu_type1.c
also works with the mdev groups, except for the point where we actually
try to attach a group to a domain, which is the only window where we use
the iommu_device rather than the provided group, but we don't record
that anywhere.  Should struct vfio_group have a pointer to a reference
counted object that tracks the actual iommu_group attached, such that
we can determine that the group is already attached to the domain and
not try to attach again?  Ideally I'd be able to bind one device to
vfio-pci, the other to vfio-mdev-pci, and be able to use them both
within the same container.  It seems like this should be possible, it's
the same effective iommu configuration as if they were both bound to
vfio-pci.  Thanks,

Alex

diff --git a/drivers/vfio/pci/vfio_mdev_pci.c b/drivers/vfio/pci/vfio_mdev_pci.c
index 07c8067b3f73..09143d3e5473 100644
--- a/drivers/vfio/pci/vfio_mdev_pci.c
+++ b/drivers/vfio/pci/vfio_mdev_pci.c
@@ -65,18 +65,22 @@ MODULE_PARM_DESC(disable_idle_d3,

 static struct pci_driver vfio_mdev_pci_driver;

-static ssize_t
-name_show(struct kobject *kobj, struct device *dev, char *buf)
-{
-   return sprintf(buf, "%s-type1\n", dev_name(dev));
-}
-
-MDEV_TYPE_ATTR_RO(name);
+struct vfio_mdev_pci_device {
+   struct vfio_pci_device vdev;
+   struct mdev_parent_ops ops;
+

Re: [PATCH] net: fddi: skfp: remove generic PCI defines from skfbi.h

2019-06-19 Thread Puranjay Mohan

On Wed, Jun 19, 2019 at 02:10:22PM -0500, Bjorn Helgaas wrote:
> On Wed, Jun 19, 2019 at 12:48 PM Puranjay Mohan  wrote:
> >
> > skfbi.h defines its own copies of PCI_COMMAND, PCI_STATUS, etc.
> > remove them in favor of the generic definitions in
> > include/uapi/linux/pci_regs.h
> 
> 1) Since you're sending several related patches, send them as a
> "series" with a cover letter, e.g.,
> 
>   [PATCH v2 0/2] Use PCI generic definitions instead of private duplicates
>   [PATCH v2 1/2] Include generic PCI definitions
>   [PATCH v2 2/2] Remove unused private PCI definitions
> 
> Patches 1/2 and 2/2 should be replies to the 0/2 cover letter.  "git
> send-email" will do this for you if you figure out the right options.
> 
> 2) Make sure all your subject lines match.  One started with "Include"
> and the other with "remove".  They should both be capitalized.
> 
> 3) Start sentences with a capital letter, i.e., "Remove them" above.
> 
> 4) This commit log needs to explicitly say that you're removing
> *unused* symbols.  Since they're unused, you don't even need to refer
> to pci_regs.h.
> 
> 5) "git grep PCI_ drivers/net/fddi/skfp" says there are many more
> unused PCI symbols than just the ones below.  I would just remove them
> all at once.
> 
> 6) Obviously you should compile this to make sure it builds.  It must
> build cleanly after every patch, not just at the end.  I assume you've
> done this already.
>
Yes, I build the driver after every change and I do it again before
sending the patch to be sure that it works.
> 7) Please cc: linux-...@vger.kernel.org since you're making PCI-related 
> changes.
> 
sure.
> > Signed-off-by: Puranjay Mohan 
> > ---
> >  drivers/net/fddi/skfp/h/skfbi.h | 23 ---
> >  1 file changed, 23 deletions(-)
> >
> > diff --git a/drivers/net/fddi/skfp/h/skfbi.h 
> > b/drivers/net/fddi/skfp/h/skfbi.h
> > index 89557457b352..ed144a8e78d1 100644
> > --- a/drivers/net/fddi/skfp/h/skfbi.h
> > +++ b/drivers/net/fddi/skfp/h/skfbi.h
> > @@ -27,29 +27,6 @@
> >  /*
> >   * Configuration Space header
> >   */
> > -#definePCI_VENDOR_ID   0x00/* 16 bit   Vendor ID */
> > -#definePCI_DEVICE_ID   0x02/* 16 bit   Device ID */
> > -#definePCI_COMMAND 0x04/* 16 bit   Command */
> > -#definePCI_STATUS  0x06/* 16 bit   Status */
> > -#definePCI_REV_ID  0x08/*  8 bit   Revision ID */
> > -#definePCI_CLASS_CODE  0x09/* 24 bit   Class Code */
> > -#definePCI_CACHE_LSZ   0x0c/*  8 bit   Cache Line Size */
> > -#definePCI_LAT_TIM 0x0d/*  8 bit   Latency Timer */
> > -#definePCI_HEADER_T0x0e/*  8 bit   Header Type */
> > -#definePCI_BIST0x0f/*  8 bit   Built-in selftest */
> > -#definePCI_BASE_1ST0x10/* 32 bit   1st Base address */
> > -#definePCI_BASE_2ND0x14/* 32 bit   2nd Base address */
> > -/* Byte 18..2b:Reserved */
> > -#definePCI_SUB_VID 0x2c/* 16 bit   Subsystem Vendor ID 
> > */
> > -#definePCI_SUB_ID  0x2e/* 16 bit   Subsystem ID */
> > -#definePCI_BASE_ROM0x30/* 32 bit   Expansion ROM Base 
> > Address */
> > -/* Byte 34..33:Reserved */
> > -#define PCI_CAP_PTR0x34/*  8 bit (ML)  Capabilities Ptr */
> > -/* Byte 35..3b:Reserved */
> > -#definePCI_IRQ_LINE0x3c/*  8 bit   Interrupt Line */
> > -#definePCI_IRQ_PIN 0x3d/*  8 bit   Interrupt Pin */
> > -#definePCI_MIN_GNT 0x3e/*  8 bit   Min_Gnt */
> > -#definePCI_MAX_LAT 0x3f/*  8 bit   Max_Lat */
> >  /* Device Dependent Region */
> >  #definePCI_OUR_REG 0x40/* 32 bit (DV)  Our Register */
> >  #definePCI_OUR_REG_1   0x40/* 32 bit (ML)  Our Register 1 */
> > --
> > 2.21.0
> >

Thanks for the feedback!
I will send the patch series soon.

Thanks

--Puranjay

Re: [PATCH v2 4/5] mm: introduce MADV_PAGEOUT

2019-06-19 Thread Minchan Kim

On Wed, Jun 19, 2019 at 03:24:50PM +0200, Michal Hocko wrote:
> On Mon 10-06-19 20:12:51, Minchan Kim wrote:
> [...]
> > +static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> > +   unsigned long end, struct mm_walk *walk)
> 
> Again the same question about a potential code reuse...
> [...]
> > +regular_page:
> > +   tlb_change_page_size(tlb, PAGE_SIZE);
> > +   orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, );
> > +   flush_tlb_batched_pending(mm);
> > +   arch_enter_lazy_mmu_mode();
> > +   for (; addr < end; pte++, addr += PAGE_SIZE) {
> > +   ptent = *pte;
> > +   if (!pte_present(ptent))
> > +   continue;
> > +
> > +   page = vm_normal_page(vma, addr, ptent);
> > +   if (!page)
> > +   continue;
> > +
> > +   if (isolate_lru_page(page))
> > +   continue;
> > +
> > +   isolated++;
> > +   if (pte_young(ptent)) {
> > +   ptent = ptep_get_and_clear_full(mm, addr, pte,
> > +   tlb->fullmm);
> > +   ptent = pte_mkold(ptent);
> > +   set_pte_at(mm, addr, pte, ptent);
> > +   tlb_remove_tlb_entry(tlb, pte, addr);
> > +   }
> > +   ClearPageReferenced(page);
> > +   test_and_clear_page_young(page);
> > +   list_add(>lru, _list);
> > +   if (isolated >= SWAP_CLUSTER_MAX) {
> 
> Why do we need SWAP_CLUSTER_MAX batching? Especially when we need ...
> [...]

It aims for preventing early OOM kill since we isolate too many LRU
pages concurrently.

> 
> > +unsigned long reclaim_pages(struct list_head *page_list)
> > +{
> > +   int nid = -1;
> > +   unsigned long nr_reclaimed = 0;
> > +   LIST_HEAD(node_page_list);
> > +   struct reclaim_stat dummy_stat;
> > +   struct scan_control sc = {
> > +   .gfp_mask = GFP_KERNEL,
> > +   .priority = DEF_PRIORITY,
> > +   .may_writepage = 1,
> > +   .may_unmap = 1,
> > +   .may_swap = 1,
> > +   };
> > +
> > +   while (!list_empty(page_list)) {
> > +   struct page *page;
> > +
> > +   page = lru_to_page(page_list);
> > +   if (nid == -1) {
> > +   nid = page_to_nid(page);
> > +   INIT_LIST_HEAD(_page_list);
> > +   }
> > +
> > +   if (nid == page_to_nid(page)) {
> > +   list_move(>lru, _page_list);
> > +   continue;
> > +   }
> > +
> > +   nr_reclaimed += shrink_page_list(_page_list,
> > +   NODE_DATA(nid),
> > +   , 0,
> > +   _stat, false);
> 
> per-node batching in fact. Other than that nothing really jumped at me.
> Except for the shared page cache side channel timing aspect not being
> considered AFAICS. To be more specific. Pushing out a shared page cache
> is possible even now but this interface gives a much easier tool to
> evict shared state and perform all sorts of timing attacks. Unless I am
> missing something we should be doing something similar to mincore and
> ignore shared pages without a writeable access or at least document why
> we do not care.

I'm not sure IIUC side channel attach. As you mentioned, without this syscall,
1. they already can do that simply by memory hogging
2. If we need fix MADV_PAGEOUT, that means we need to fix MADV_DONTNEED, too?

[PATCH v2] ocxl: Allow contexts to be attached with a NULL mm

2019-06-19 Thread Alastair D'Silva

From: Alastair D'Silva 

If an OpenCAPI context is to be used directly by a kernel driver, there
may not be a suitable mm to use.

The patch makes the mm parameter to ocxl_context_attach optional.

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/mm/book3s64/radix_tlb.c |  5 +
 drivers/misc/ocxl/context.c  |  9 ++---
 drivers/misc/ocxl/link.c | 28 
 3 files changed, 35 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index bb9835681315..ce8a77fae6a7 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -666,6 +666,11 @@ EXPORT_SYMBOL(radix__flush_tlb_page);
 #define radix__flush_all_mm radix__local_flush_all_mm
 #endif /* CONFIG_SMP */
 
+/*
+ * If kernel TLBIs ever become local rather than global, then
+ * drivers/misc/ocxl/link.c:ocxl_link_add_pe will need some work, as it
+ * assumes kernel TLBIs are global.
+ */
 void radix__flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
_tlbie_pid(0, RIC_FLUSH_ALL);
diff --git a/drivers/misc/ocxl/context.c b/drivers/misc/ocxl/context.c
index bab9c9364184..994563a078eb 100644
--- a/drivers/misc/ocxl/context.c
+++ b/drivers/misc/ocxl/context.c
@@ -69,6 +69,7 @@ static void xsl_fault_error(void *data, u64 addr, u64 dsisr)
 int ocxl_context_attach(struct ocxl_context *ctx, u64 amr, struct mm_struct 
*mm)
 {
int rc;
+   unsigned long pidr = 0;
 
// Locks both status & tidr
mutex_lock(>status_mutex);
@@ -77,9 +78,11 @@ int ocxl_context_attach(struct ocxl_context *ctx, u64 amr, 
struct mm_struct *mm)
goto out;
}
 
-   rc = ocxl_link_add_pe(ctx->afu->fn->link, ctx->pasid,
-   mm->context.id, ctx->tidr, amr, mm,
-   xsl_fault_error, ctx);
+   if (mm)
+   pidr = mm->context.id;
+
+   rc = ocxl_link_add_pe(ctx->afu->fn->link, ctx->pasid, pidr, ctx->tidr,
+ amr, mm, xsl_fault_error, ctx);
if (rc)
goto out;
 
diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
index cce5b0d64505..58d111afd9f6 100644
--- a/drivers/misc/ocxl/link.c
+++ b/drivers/misc/ocxl/link.c
@@ -224,6 +224,17 @@ static irqreturn_t xsl_fault_handler(int irq, void *data)
ack_irq(spa, ADDRESS_ERROR);
return IRQ_HANDLED;
}
+
+   if (!pe_data->mm) {
+   /*
+* translation fault from a kernel context - an OpenCAPI
+* device tried to access a bad kernel address
+*/
+   rcu_read_unlock();
+   pr_warn("Unresolved OpenCAPI xsl fault in kernel context\n");
+   ack_irq(spa, ADDRESS_ERROR);
+   return IRQ_HANDLED;
+   }
WARN_ON(pe_data->mm->context.id != pid);
 
if (mmget_not_zero(pe_data->mm)) {
@@ -523,7 +534,13 @@ int ocxl_link_add_pe(void *link_handle, int pasid, u32 
pidr, u32 tidr,
pe->amr = cpu_to_be64(amr);
pe->software_state = cpu_to_be32(SPA_PE_VALID);
 
-   mm_context_add_copro(mm);
+   /*
+* For user contexts, register a copro so that TLBIs are seen
+* by the nest MMU. If we have a kernel context, TLBIs are
+* already global.
+*/
+   if (mm)
+   mm_context_add_copro(mm);
/*
 * Barrier is to make sure PE is visible in the SPA before it
 * is used by the device. It also helps with the global TLBI
@@ -546,7 +563,8 @@ int ocxl_link_add_pe(void *link_handle, int pasid, u32 
pidr, u32 tidr,
 * have a reference on mm_users. Incrementing mm_count solves
 * the problem.
 */
-   mmgrab(mm);
+   if (mm)
+   mmgrab(mm);
trace_ocxl_context_add(current->pid, spa->spa_mem, pasid, pidr, tidr);
 unlock:
mutex_unlock(>spa_lock);
@@ -652,8 +670,10 @@ int ocxl_link_remove_pe(void *link_handle, int pasid)
if (!pe_data) {
WARN(1, "Couldn't find pe data when removing PE\n");
} else {
-   mm_context_remove_copro(pe_data->mm);
-   mmdrop(pe_data->mm);
+   if (pe_data->mm) {
+   mm_context_remove_copro(pe_data->mm);
+   mmdrop(pe_data->mm);
+   }
kfree_rcu(pe_data, rcu);
}
 unlock:
-- 
2.21.0

Re: [PATCH] ext4: remove redundant assignment to node

2019-06-19 Thread Theodore Ts'o

On Wed, Jun 19, 2019 at 10:00:06AM +0100, Colin King wrote:
> From: Colin Ian King 
> 
> Pointer 'node' is assigned a value that is never read, node is
> later overwritten when it re-assigned a different value inside
> the while-loop.  The assignment is redundant and can be removed.
> 
> Addresses-Coverity: ("Unused value")
> Signed-off-by: Colin Ian King 

Applied, thanks.

- Ted

Re: [PATCH] ext4: make __ext4_get_inode_loc plug

2019-06-19 Thread Zhangjs Jinshui



> 在 2019年6月19日，19:08，Jan Kara  写道：
> 
> On Mon 17-06-19 23:57:12, jinshui zhang wrote:
>> From: zhangjs 
>> 
>> If the task is unplugged when called, the inode_readahead_blks may not be 
>> merged, 
>> these will cause small pieces of io, It should be plugged.
>> 
>> Signed-off-by: zhangjs 
> 
> Out of curiosity, on which path do you see __ext4_get_inode_loc() being
> called without IO already plugged?
> 
> Otherwise the patch looks good to me. You can add:
> 
> Reviewed-by: Jan Kara 
> 
>   Honza
> 
>> ---
>> fs/ext4/inode.c | 6 ++
>> 1 file changed, 6 insertions(+)
>> 
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index c7f77c6..8fe046b 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -4570,6 +4570,7 @@ static int __ext4_get_inode_loc(struct inode *inode,
>>  struct buffer_head  *bh;
>>  struct super_block  *sb = inode->i_sb;
>>  ext4_fsblk_tblock;
>> +struct blk_plug plug;
>>  int inodes_per_block, inode_offset;
>> 
>>  iloc->bh = NULL;
>> @@ -4654,6 +4655,8 @@ static int __ext4_get_inode_loc(struct inode *inode,
>>  }
>> 
>> make_io:
>> +blk_start_plug();
>> +
>>  /*
>>   * If we need to do any I/O, try to pre-readahead extra
>>   * blocks from the inode table.
>> @@ -4688,6 +4691,9 @@ static int __ext4_get_inode_loc(struct inode *inode,
>>  get_bh(bh);
>>  bh->b_end_io = end_buffer_read_sync;
>>  submit_bh(REQ_OP_READ, REQ_META | REQ_PRIO, bh);
>> +
>> +blk_finish_plug();
>> +
>>  wait_on_buffer(bh);
>>  if (!buffer_uptodate(bh)) {
>>  EXT4_ERROR_INODE_BLOCK(inode, block,
>> -- 
>> 1.8.3.1
>> 
> -- 
> Jan Kara 
> SUSE Labs, CR

You can blktrace

  8,80  31   11 0.296373038 2885275  Q  RA 8279571464 + 8 []
  8,80  31   12 0.296374017 2885275  G  RA 8279571464 + 8 []
  8,80  31   13 0.296375468 2885275  I  RA 8279571464 + 8 []
  8,80  31   14 0.296382099  3886  D  RA 8279571464 + 8 [kworker/31:1H]
  8,80  31   15 0.296391907 2885275  Q  RA 8279571472 + 8 []
  8,80  31   16 0.296392275 2885275  G  RA 8279571472 + 8 []
  8,80  31   17 0.296393305 2885275  I  RA 8279571472 + 8 []
  8,80  31   18 0.296395844  3886  D  RA 8279571472 + 8 [kworker/31:1H]
  8,80  31   19 0.296399685 2885275  Q  RA 8279571480 + 8 []
  8,80  31   20 0.296400025 2885275  G  RA 8279571480 + 8 []
  8,80  31   21 0.296401232 2885275  I  RA 8279571480 + 8 []
  8,80  31   22 0.296403422  3886  D  RA 8279571480 + 8 [kworker/31:1H]
  8,80  31   23 0.296407375 2885275  Q  RA 8279571488 + 8 []
  8,80  31   24 0.296407721 2885275  G  RA 8279571488 + 8 []
  8,80  31   25 0.296408904 2885275  I  RA 8279571488 + 8 []
  8,80  31   26 0.296411127  3886  D  RA 8279571488 + 8 [kworker/31:1H]
  8,80  31   27 0.296414779 2885275  Q  RA 8279571496 + 8 []
  8,80  31   28 0.296415119 2885275  G  RA 8279571496 + 8 []
  8,80  31   29 0.296415744 2885275  I  RA 8279571496 + 8 []
  8,80  31   30 0.296417779  3886  D  RA 8279571496 + 8 [kworker/31:1H]

these RA io were caused by ext4_inode_readahead_blks, there are all not merged 
becourse of the unplugged state.
the backtrace shows below, was traced by systemtap ioblock.request filtered by 
"opf & 1 << 19"

 0x8136fb20 : generic_make_request+0x0/0x2f0 [kernel]
 0x8136fe7e : submit_bio+0x6e/0x130 [kernel]
 0x812971e6 : submit_bh_wbc+0x156/0x190 [kernel]
 0x81297bca : ll_rw_block+0x6a/0xb0 [kernel]
 0x81297cc0 : __breadahead+0x40/0x70 [kernel]
 0xa0392c9a : __ext4_get_inode_loc+0x37a/0x3d0 [ext4]
 0xa0396a6c : ext4_iget+0x8c/0xc00 [ext4]
 0xa03ad98a : ext4_lookup+0xca/0x1d0 [ext4]
 0x8126b814 : path_openat+0xcb4/0x1250 [kernel]
 0x8126dc41 : do_filp_open+0x91/0x100 [kernel]
 0x8125ad86 : do_sys_open+0x126/0x210 [kernel]
 0x81003864 : do_syscall_64+0x74/0x1a0 [kernel]
 0x81800081 : entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [kernel]

I have patched it on online servers, It can improved the performance.

June inquiry

2019-06-19 Thread Marketing Dept

Hello dear,
 
We are in the market for your products after meeting at your stand during last 
expo.
 
Please kindly send us your latest catalog and price list so as to start a new 
project/order as promised during the exhibition. 
 
I would appreciate your response about the above details required so we can 
revert back to you asap.
 
Kind regards
 
Hyuan Cloe

Re: [PATCH v2] perf cs-etm: Improve completeness for kernel address space

2019-06-19 Thread Leo Yan

Hi all,

On Thu, Jun 20, 2019 at 08:54:28AM +0800, Leo Yan wrote:

[...]

> diff --git a/tools/perf/Makefile.config b/tools/perf/Makefile.config
> index 51dd00f65709..cf5906d667aa 100644
> --- a/tools/perf/Makefile.config
> +++ b/tools/perf/Makefile.config
> @@ -418,6 +418,26 @@ ifdef CORESIGHT
>  endif
>  LDFLAGS += $(LIBOPENCSD_LDFLAGS)
>  EXTLIBS += $(OPENCSDLIBS)
> +ARM_PRE_START_SIZE := 0
> +ifeq ($(SRCARCH),arm64)
> +  # Extract info from lds:
> +  #  . = 0x)) - (((1)) << (48)) + 1) + (0)) + 
> (0x0800))) + (0x0800))) + 0x0008;
> +  # ARM_PRE_START_SIZE := (0x0800 + 0x0800 + 0x0008)
> +  ARM_PRE_START_SIZE := $(shell egrep ' \. \= \({8}0x[0-9a-fA-F]+\){2}' \
> +$(srctree)/arch/arm64/kernel/vmlinux.lds | \
> +sed -e 's/[(|)|.|=|+|<|;|-]//g' -e 's/ \+/ /g' -e 's/^[ \t]*//' | \
> +awk -F' ' '{print "("$$6 "+"  $$7 "+" $$8")"}' 2>/dev/null)
> +endif
> +ifeq ($(SRCARCH),arm)
> +  # Extract info from lds:
> +  #   . = ((0xC000)) + 0x00208000;
> +  # ARM_PRE_START_SIZE := 0x00208000
> +  ARM_PRE_START_SIZE := $(shell egrep ' \. \= \({2}0x[0-9a-fA-F]+\){2}' \
> +$(srctree)/arch/arm/kernel/vmlinux.lds | \
> +sed -e 's/[(|)|.|=|+|<|;|-]//g' -e 's/ \+/ /g' -e 's/^[ \t]*//' | \
> +awk -F' ' '{print "("$$2")"}' 2>/dev/null)
> +endif
> +CFLAGS += -DARM_PRE_START_SIZE="$(ARM_PRE_START_SIZE)"

I did testing for building perf with this patch, this patch is fragile
and easily introduce the building warning:

  : error: "ARM_PRE_START_SIZE" redefined [-Werror]
  : note: this is the location of the previous definition

To dismiss this error, I need to change the macro define as below:

  +CFLAGS += -DARM_PRE_START_SIZE=$(ARM_PRE_START_SIZE)

So I sent patch v3 to address this issue and please directly reivew
patch v3.  Sorry for spamming.

Thanks,
Leo Yan


>  $(call detected,CONFIG_LIBOPENCSD)
>  ifdef CSTRACE_RAW
>CFLAGS += -DCS_DEBUG_RAW
> diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
> index 0c7776b51045..5fa0be3a3904 100644
> --- a/tools/perf/util/cs-etm.c
> +++ b/tools/perf/util/cs-etm.c
> @@ -613,10 +613,27 @@ static void cs_etm__free(struct perf_session *session)
>  static u8 cs_etm__cpu_mode(struct cs_etm_queue *etmq, u64 address)
>  {
>   struct machine *machine;
> + u64 fixup_kernel_start = 0;
>  
>   machine = etmq->etm->machine;
>  
> - if (address >= etmq->etm->kernel_start) {
> + /*
> +  * Since arm and arm64 specify some memory regions prior to
> +  * 'kernel_start', kernel addresses can be less than 'kernel_start'.
> +  *
> +  * For arm architecture, the 16MB virtual memory space prior to
> +  * 'kernel_start' is allocated to device modules, a PMD table if
> +  * CONFIG_HIGHMEM is enabled and a PGD table.
> +  *
> +  * For arm64 architecture, the root PGD table, device module memory
> +  * region and BPF jit region are prior to 'kernel_start'.
> +  *
> +  * To reflect the complete kernel address space, compensate these
> +  * pre-defined regions for kernel start address.
> +  */
> + fixup_kernel_start = etmq->etm->kernel_start - ARM_PRE_START_SIZE;
> +
> + if (address >= fixup_kernel_start) {
>   if (machine__is_host(machine))
>   return PERF_RECORD_MISC_KERNEL;
>   else
> -- 
> 2.17.1
>

[PATCH v3] perf cs-etm: Improve completeness for kernel address space

2019-06-19 Thread Leo Yan

Arm and arm64 architecture reserve some memory regions prior to the
symbol '_stext' and these memory regions later will be used by device
module and BPF jit.  The current code misses to consider these memory
regions thus any address in the regions will be taken as user space
mode, but perf cannot find the corresponding dso with the wrong CPU
mode so we misses to generate samples for device module and BPF
related trace data.

This patch parse the link scripts to get the memory size prior to start
address and reduce this size from 'etmq->etm->kernel_start', then can
get a fixed up kernel start address which contain memory regions for
device module and BPF.  Finally, cs_etm__cpu_mode() can return right
mode for these memory regions and perf can successfully generate
samples.

The reason for parsing the link scripts is Arm architecture changes text
offset dependent on different platforms, which define multiple text
offsets in $kernel/arch/arm/Makefile.  This offset is decided when build
kernel and the final value is extended in the link script, so we can
extract the used value from the link script.  We use the same way to
parse arm64 link script as well.  If fail to find the link script, the
pre start memory size is assumed as zero, in this case it has no any
change caused with this patch.

Below is detailed info for testing this patch:

- Build LLVM/Clang 8.0 or later version;

- Configure perf with ~/.perfconfig:

  root@debian:~# cat ~/.perfconfig
  # this file is auto-generated.
  [llvm]
  clang-path = /mnt/build/llvm-build/build/install/bin/clang
  kbuild-dir = /mnt/linux-kernel/linux-cs-dev/
  clang-opt = "-g"
  dump-obj = true

  [trace]
  show_zeros = yes
  show_duration = no
  no_inherit = yes
  show_timestamp = no
  show_arg_names = no
  args_alignment = 40
  show_prefix = yes

- Run 'perf trace' command with eBPF event:

  root@debian:~# perf trace -e string \
  -e $kernel/tools/perf/examples/bpf/augmented_raw_syscalls.c

- Read eBPF program memory mapping in kernel:

  root@debian:~# echo 1 > /proc/sys/net/core/bpf_jit_kallsyms
  root@debian:~# cat /proc/kallsyms | grep -E "bpf_prog_.+_sys_[enter|exit]"
  00086a84 t bpf_prog_f173133dc38ccf87_sys_enter  [bpf]
  00088618 t bpf_prog_c1bd85c092d6e4aa_sys_exit   [bpf]

- Launch any program which accesses file system frequently so can hit
  the system calls trace flow with eBPF event;

- Capture CoreSight trace data with filtering eBPF program:

  root@debian:~# perf record -e cs_etm/@2007.etr/ \
  --filter 'filter 0x00086a84/0x800' -a sleep 5s

- Annotate for symbol 'bpf_prog_f173133dc38ccf87_sys_enter':

  root@debian:~# perf report
  Then select 'branches' samples and press 'a' to annotate symbol
  'bpf_prog_f173133dc38ccf87_sys_enter', press 'P' to print to the
  bpf_prog_f173133dc38ccf87_sys_enter.annotation file:

  root@debian:~# cat bpf_prog_f173133dc38ccf87_sys_enter.annotation

  bpf_prog_f173133dc38ccf87_sys_enter() bpf_prog_f173133dc38ccf87_sys_enter
  Event: branches

  Percent  int sys_enter(struct syscall_enter_args *args)
 stp  x29, x30, [sp, #-16]!

int key = 0;
 mov  x29, sp

   augmented_args = 
bpf_map_lookup_elem(_filename_map, );
 stp  x19, x20, [sp, #-16]!

   augmented_args = 
bpf_map_lookup_elem(_filename_map, );
 stp  x21, x22, [sp, #-16]!

 stp  x25, x26, [sp, #-16]!

return bpf_get_current_pid_tgid();
 mov  x25, sp

return bpf_get_current_pid_tgid();
 mov  x26, #0x0 // #0

 sub  sp, sp, #0x10

return bpf_map_lookup_elem(pids, ) != NULL;
 add  x19, x0, #0x0

 mov  x0, #0x0  // #0

 mov  x10, #0xfff8  // #-8

if (pid_filter__has(_filtered, getpid()))
 str  w0, [x25, x10]

probe_read(_args->args, sizeof(augmented_args->args), 
args);
 add  x1, x25, #0x0

probe_read(_args->args, sizeof(augmented_args->args), 
args);
 mov  x10, #0xfff8  // #-8

syscall = bpf_map_lookup_elem(, 
_args->args.syscall_nr);
 add  x1, x1, x10

syscall = bpf_map_lookup_elem(, 
_args->args.syscall_nr);
 mov  x0, #0x8009   // #-140694538682369

 movk x0, #0x6698, lsl #16

 movk x0, #0x3e00

 mov  x10, #0x1040  // #-61376

if (syscall == NULL || !syscall->enabled)
 movk x10, #0x1023, lsl #16

if (syscall == NULL || !syscall->enabled)
 movk x10, #0x0, lsl #32

Re: [PATCH] ext4: make __ext4_get_inode_loc plug

2019-06-19 Thread Theodore Ts'o

On Mon, Jun 17, 2019 at 11:57:12PM +0800, jinshui zhang wrote:
> From: zhangjs 
> 
> If the task is unplugged when called, the inode_readahead_blks may not be 
> merged, 
> these will cause small pieces of io, It should be plugged.
> 
> Signed-off-by: zhangjs 

Thanks, applied.

I cleaned up the commit description a little, and I removed some of
the extra empty lines added by the patch.

Cheers,

- Ted

linux-next: manual merge of the apparmor tree with Linus' tree

Hi all,

Today's linux-next merge of the apparmor tree got a conflict in:

  security/apparmor/include/policy.h

between commit:

  23375b13f98c ("apparmor: fix PROFILE_MEDIATES for untrusted input")

from Linus' tree and commit:

  06c13f554a71 ("apparmor: re-introduce a variant of PROFILE_MEDIATES_SAFE")

from the apparmor tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc security/apparmor/include/policy.h
index b5b4b8190e65,9af2114e1bf0..
--- a/security/apparmor/include/policy.h
+++ b/security/apparmor/include/policy.h
@@@ -213,14 -217,13 +213,22 @@@ static inline struct aa_profile *aa_get
return labels_profile(aa_get_newest_label(>label));
  }
  
 -#define PROFILE_MEDIATES(P, T)  ((P)->policy.start[(unsigned char) (T)])
 +static inline unsigned int PROFILE_MEDIATES(struct aa_profile *profile,
 +  unsigned char class)
 +{
 +  if (class <= AA_CLASS_LAST)
 +  return profile->policy.start[class];
 +  else
 +  return aa_dfa_match_len(profile->policy.dfa,
 +  profile->policy.start[0], , 1);
++}
++
+ /* safe version of POLICY_MEDIATES for full range input */
+ static inline unsigned int PROFILE_MEDIATES_SAFE(struct aa_profile *profile,
+unsigned char class)
+ {
+   return aa_dfa_match_len(profile->policy.dfa,
+   profile->policy.start[0], , 1);
  }
  
  static inline unsigned int PROFILE_MEDIATES_AF(struct aa_profile *profile,


pgpvbp5Uwea30.pgp
Description: OpenPGP digital signature

Re: [PATCH v2 0/2] arm64: dts: g12a/g12b: add the Ethernet PHY GPIO IRQs

2019-06-19 Thread Kevin Hilman

Martin Blumenstingl  writes:

> Avoid polling of the PHY status by passing the Ethernet PHY's GPIO
> interrupt line to the PHY node.
>
> I tested this successfully on my X96 Max, but I don't have an Odroid-N2
> to test it there. The reset and interrupt GPIO part of the schematics
> seems to be identical for both boards (and probably other "reference
> design" based boards as well).
>
> This depends on my other series "Ethernet PHY reset GPIO updates for
> Amlogic SoCs" from [0]
>
>
> Changes since v1 at [1]:
> - added Neil's Tested/Acked-by (thank you!)
> - rebased on top of v3 of my other series
> - updated cover-letter since the GPIO interrupt controller support
>   is now merged so it's not a dependency anymore

Queued for v5.3,

Thanks,

Kevin

RE: [RFC net-next 1/5] net: stmmac: introduce IEEE 802.1Qbv configuration functionalities

2019-06-19 Thread Ong, Boon Leong

>-Original Message-
>From: Gomes, Vinicius
>> +++ b/drivers/net/ethernet/stmicro/stmmac/dw_tsn_lib.c
>> @@ -0,0 +1,790 @@
>> +
>> +static struct tsn_hw_cap dw_tsn_hwcap;
>> +static bool dw_tsn_feat_en[TSN_FEAT_ID_MAX];
>> +static unsigned int dw_tsn_hwtunable[TSN_HWTUNA_MAX];
>> +static struct est_gc_config dw_est_gc_config;
>
>If it's at all possible to have more than one of these devices in a
>system, this should be moved to a per-device structure. That
>mac_device_info struct perhaps?
I do see value in scaling the code to more than one device there.
Thanks.

>> +void dwmac_tsn_init(void *ioaddr)
>
>Perhaps this should return an error if TSN is not supported. It may help
>simplify the initialization below.
Thanks for the input. It may not be apparent because this code does not
include Qbu detection yet. The thinking here is to avoid caller function
not need to handle and IP configuration difference, i.e. SoC-1 may have only
Qbv and SoC-2 have both. 

>
>> +{
>> +unsigned int hwid = TSN_RD32(ioaddr + GMAC4_VERSION) &
>TSN_VER_MASK;
>> +unsigned int hw_cap2 = TSN_RD32(ioaddr + GMAC_HW_FEATURE2);
>> +unsigned int hw_cap3 = TSN_RD32(ioaddr + GMAC_HW_FEATURE3);
>> +struct tsn_hw_cap *cap = _tsn_hwcap;
>> +unsigned int gcl_depth;
>> +unsigned int tils_max;
>> +unsigned int ti_wid;
>> +
>> +memset(cap, 0, sizeof(*cap));
>> +
>> +if (hwid < TSN_CORE_VER) {
>> +TSN_WARN_NA("IP v5.00 does not support TSN\n");
Perhaps, we just print info here instead of warning because SoC with EQoS v5
can be built without Qbv. 

>> +return;
>> +}
>> +
>> +if (!(hw_cap3 & GMAC_HW_FEAT_ESTSEL)) {
>> +TSN_WARN_NA("EST NOT supported\n");
>> +cap->est_support = 0;
Same here. 

>> +
>> +return;
>> +}
>> +
>> +gcl_depth = est_get_gcl_depth(hw_cap3);
>> +ti_wid = est_get_ti_width(hw_cap3);
>> +
>> +cap->ti_wid = ti_wid;
>> +cap->gcl_depth = gcl_depth;
>> +
>> +tils_max = (hw_cap3 & GMAC_HW_FEAT_ESTSEL ? 3 : 0);
>> +tils_max = (1 << tils_max) - 1;
>> +cap->tils_max = tils_max;
>> +
>> +cap->ext_max = EST_TIWID_TO_EXTMAX(ti_wid);
>> +cap->txqcnt = ((hw_cap2 & GMAC_HW_FEAT_TXQCNT) >> 6) + 1;
>> +cap->est_support = 1;
>> +
>> +TSN_INFO("EST: depth=%u, ti_wid=%u, tils_max=%u tqcnt=%u\n",
>> + gcl_depth, ti_wid, tils_max, cap->txqcnt);
>> +}

>> diff --git a/drivers/net/ethernet/stmicro/stmmac/hwif.h
>b/drivers/net/ethernet/stmicro/stmmac/hwif.h
>> index 2acfbc70e3c8..518a72805185 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/hwif.h
>> +++ b/drivers/net/ethernet/stmicro/stmmac/hwif.h
>> @@ -7,6 +7,7 @@
>>
>>  #include 
>>  #include 
>> +#include "dw_tsn_lib.h"
>>
>>  #define stmmac_do_void_callback(__priv, __module, __cname,  __arg0,
>__args...) \
>>  ({ \
>> @@ -311,6 +312,31 @@ struct stmmac_ops {
>>   bool loopback);
>>  void (*pcs_rane)(void __iomem *ioaddr, bool restart);
>>  void (*pcs_get_adv_lp)(void __iomem *ioaddr, struct rgmii_adv *adv);
>> +/* TSN functions */
>> +void (*tsn_init)(void __iomem *ioaddr);
>> +void (*get_tsn_hwcap)(struct tsn_hw_cap **tsn_hwcap);
>> +void (*set_est_gcb)(struct est_gc_entry *gcl,
>> +u32 bank);
>> +void (*set_tsn_feat)(enum tsn_feat_id featid, bool enable);
>> +int (*set_tsn_hwtunable)(void __iomem *ioaddr,
>> + enum tsn_hwtunable_id id,
>> + const unsigned int *data);
>> +int (*get_tsn_hwtunable)(enum tsn_hwtunable_id id,
>> + unsigned int *data);
>> +int (*get_est_bank)(void __iomem *ioaddr, u32 own);
>> +int (*set_est_gce)(void __iomem *ioaddr,
>> +   struct est_gc_entry *gce, u32 row,
>> +   u32 dbgb, u32 dbgm);
>> +int (*get_est_gcrr_llr)(void __iomem *ioaddr, u32 *gcl_len,
>> +u32 dbgb, u32 dbgm);
>> +int (*set_est_gcrr_llr)(void __iomem *ioaddr, u32 gcl_len,
>> +u32 dbgb, u32 dbgm);
>> +int (*set_est_gcrr_times)(void __iomem *ioaddr,
>> +  struct est_gcrr *gcrr,
>> +  u32 dbgb, u32 dbgm);
>> +int (*set_est_enable)(void __iomem *ioaddr, bool enable);
>> +int (*get_est_gcc)(void __iomem *ioaddr,
>> +   struct est_gc_config **gcc, bool frmdrv);
>
>These functions do not seem to be consistent with the rest of the
>stmmac_ops: most of the operations already there receive an
>mac_device_info as first argument, which seem much less error prone than
>a void* ioaddr.
Thanks for the input. We will look into this together with mac_device_info
and adjust accordingly.

RE: [EXT] Re: [v1 1/4] dt-bindings: display: Add DT bindings for LS1028A HDP-TX PHY.

2019-06-19 Thread Wen He



> -Original Message-
> From: Rob Herring 
> Sent: 2019年6月19日 22:07
> To: Wen He 
> Cc: linux-kernel@vger.kernel.org; linux-arm-ker...@lists.infradead.org;
> devicet...@vger.kernel.org; shawn...@kernel.org; Leo Li
> 
> Subject: Re: [EXT] Re: [v1 1/4] dt-bindings: display: Add DT bindings for
> LS1028A HDP-TX PHY.
> 
> Caution: EXT Email
> 
> On Sun, Jun 16, 2019 at 7:45 PM Wen He  wrote:
> >
> >
> >
> > > -Original Message-
> > > From: Rob Herring 
> > > Sent: 2019年6月14日 4:08
> > > To: Wen He 
> > > Cc: linux-kernel@vger.kernel.org;
> > > linux-arm-ker...@lists.infradead.org;
> > > devicet...@vger.kernel.org; shawn...@kernel.org; Leo Li
> > > 
> > > Subject: [EXT] Re: [v1 1/4] dt-bindings: display: Add DT bindings
> > > for LS1028A HDP-TX PHY.
> > >
> > > Caution: EXT Email
> > >
> > > On Wed, May 08, 2019 at 10:35:25AM +, Wen He wrote:
> > > > Add DT bindings documentmation for the HDP-TX PHY controller. The
> > > > describes which could be found on NXP Layerscape ls1028a platform.
> > >
> > > Drop the hard stop (.) from the subject.
> > >
> > > >
> > > > Signed-off-by: Wen He 
> > > > ---
> > > >  .../devicetree/bindings/display/fsl,hdp.txt   | 56
> +++
> > > >  1 file changed, 56 insertions(+)
> > > >  create mode 100644
> > > > Documentation/devicetree/bindings/display/fsl,hdp.txt
> > > >
> > > > diff --git a/Documentation/devicetree/bindings/display/fsl,hdp.txt
> > > > b/Documentation/devicetree/bindings/display/fsl,hdp.txt
> > > > new file mode 100644
> > > > index ..36b5687a1261
> > > > --- /dev/null
> > > > +++ b/Documentation/devicetree/bindings/display/fsl,hdp.txt
> > > > @@ -0,0 +1,56 @@
> > > > +NXP Layerscpae ls1028a HDP-TX PHY Controller
> > > > +
> > > > +
> > > > +The following bindings describe the Cadence HDP TX PHY on ls1028a
> > > > +that offer multi-protocol support of standars such as eDP and
> > > > +Displayport, supports for 25-600MHz pixel clock and up to 4k2k at
> > > > +60MHz
> > > resolution.
> > > > +The HDP transmitter is a Cadence HDP TX controller IP with a
> > > > +companion PHY IP.
> > >
> > > I'm confused. This binding covers both blocks or is just one of them?
> > >
> >
> > Hi Rob,
> >
> > This binding covers both blocks(HDP TX PHY and HDP TX Controller),
> > Because they are belong to the one IP.
> 
> In that case, you should also have an output port to a DP connector node (or
> DP panel).

Hi Rob,

I remember there are included the DP connector node description as see below.

---
Required sub-nodes:
- port

It there should be right?

Best Regards,
Wen

> 
> Rob

Re: [PATCH v3 0/4] Ethernet PHY reset GPIO updates for Amlogic SoCs

2019-06-19 Thread Kevin Hilman

Martin Blumenstingl  writes:

> While trying to add the Ethernet PHY interrupt on the X96 Max I found
> that the current reset line definition is incorrect. Patch #1 fixes
> this.
>
> Since the fix requires moving from the deprecated "snps,reset-gpio"
> property to the generic Ethernet PHY reset bindings I decided to move
> all Amlogic boards over to the non-deprecated bindings. That's what
> patches #2 and #3 do.
>
> Finally I found that Odroid-N2 doesn't define the Ethernet PHY's reset
> GPIO yet. I don't have that board so I can't test whether it really
> works but based on the schematics it should. 
>
> This series is a partial successor to "stmmac: honor the GPIO flags
> for the PHY reset GPIO" from [0]. I decided not to take Linus W.'s
> Reviewed-by from patch #4 of that series because I had to change the
> wording and I want to be sure that he's happy with that now.
>
> One quick note regarding patches #1 and #4: I decided to violate the
> "max 80 characters per line" (by 4 characters) limit because I find
> that the result is easier to read then it would be if I split the
> line.
>
>
> Changes since v1 at [1]:
> - fixed the reset deassert delay for RTL8211F PHYs - spotted by Robin
>   Murphy (thank you). according to the public RTL8211E datasheet the
>   correct values seem to be: 10ms assert, 30ms deassert
> - fixed the reset assert and deassert delays for IP101GR PHYs. There
>   are two values given in the public datasheet, use the higher one
>   (10ms instead of 2.5)
> - update the patch descriptions to quote the datasheets (the RTL8211F
>   quotes are taken from the public RTL8211E datasheet because as far
>   as I can tell the reset sequence is identical on both PHYs)
>
> Changes since v2 at [2]:
> - add Neil's Reviewed/Acked/Tested-by's (thank you!)
> - rebased on top of "arm64: dts: meson-g12a-x96-max: add sound card"
>
>
> [0] https://patchwork.kernel.org/cover/10983801/
> [1] https://patchwork.kernel.org/cover/10985155/
> [2] https://patchwork.kernel.org/cover/10990863/

Queued for v5.3...

> Martin Blumenstingl (4):
>   arm64: dts: meson: g12a: x96-max: fix the Ethernet PHY reset line
>   ARM: dts: meson: switch to the generic Ethernet PHY reset bindings

...in branch v5.3/dt

>   arm64: dts: meson: use the generic Ethernet PHY reset GPIO bindings
>   arm64: dts: meson: g12b: odroid-n2: add the Ethernet PHY reset line

The other 3 in v5.3/dt64,

Thanks,

Kevin

mmotm 2019-06-19-20-32 uploaded

2019-06-19 Thread akpm

The mm-of-the-moment snapshot 2019-06-19-20-32 has been uploaded to

   http://www.ozlabs.org/~akpm/mmotm/

mmotm-readme.txt says

README for mm-of-the-moment:

http://www.ozlabs.org/~akpm/mmotm/

This is a snapshot of my -mm patch queue.  Uploaded at random hopefully
more than once a week.

You will need quilt to apply these patches to the latest Linus release (5.x
or 5.x-rcY).  The series file is in broken-out.tar.gz and is duplicated in
http://ozlabs.org/~akpm/mmotm/series

The file broken-out.tar.gz contains two datestamp files: .DATE and
.DATE--mm-dd-hh-mm-ss.  Both contain the string -mm-dd-hh-mm-ss,
followed by the base kernel version against which this patch series is to
be applied.

This tree is partially included in linux-next.  To see which patches are
included in linux-next, consult the `series' file.  Only the patches
within the #NEXT_PATCHES_START/#NEXT_PATCHES_END markers are included in
linux-next.


A full copy of the full kernel tree with the linux-next and mmotm patches
already applied is available through git within an hour of the mmotm
release.  Individual mmotm releases are tagged.  The master branch always
points to the latest release, so it's constantly rebasing.

http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/



The directory http://www.ozlabs.org/~akpm/mmots/ (mm-of-the-second)
contains daily snapshots of the -mm tree.  It is updated more frequently
than mmotm, and is untested.

A git copy of this tree is available at

http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/

and use of this tree is similar to
http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/, described above.


This mmotm tree contains the following patches against 5.2-rc5:
(patches marked "*" will be included in linux-next)

  origin.patch
* mm-dev_pfn-exclude-memory_device_private-while-computing-virtual-address.patch
* fs-proc-allow-reporting-eip-esp-for-all-coredumping-threads.patch
* mm-mempolicy-fix-an-incorrect-rebind-node-in-mpol_rebind_nodemask.patch
* binfmt_flat-make-load_flat_shared_library-work.patch
* signal-remove-the-wrong-signal_pending-check-in-restore_user_sigmask.patch
* mm-soft-offline-return-ebusy-if-set_hwpoison_free_buddy_page-fails.patch
* mm-hugetlb-soft-offline-dissolve_free_huge_page-return-zero-on-pagehuge.patch
* 
mm-hugetlb-soft-offline-dissolve_free_huge_page-return-zero-on-pagehuge-v3.patch
* mm-oom_kill-fix-uninitialized-oc-constraint.patch
* initramfs-fix-populate_initrd_image-section-mismatch.patch
* mm-idle-page-fix-oops-because-end_pfn-is-larger-than-max_pfn.patch
* mm-vmalloc-avoid-bogus-wmaybe-uninitialized-warning.patch
* mm-vmalloc-avoid-bogus-wmaybe-uninitialized-warning-fix.patch
* maintainers-add-clang-llvm-build-support-info.patch
* mm-vmscan-fix-not-scanning-anonymous-pages-when-detecting-file-refaults.patch
* forkmemcg-alloc_thread_stack_node-needs-to-set-tsk-stack.patch
* iommu-replace-single-char-identifiers-in-macros.patch
* lib-test_kasan-add-bitops-tests.patch
* x86-use-static_cpu_has-in-uaccess-region-to-avoid-instrumentation.patch
* asm-generic-x86-add-bitops-instrumentation-for-kasan.patch
* 
scripts-decode_stacktrace-match-basepath-using-shell-prefix-operator-not-regex.patch
* scripts-decode_stacktrace-look-for-modules-with-kodebug-extension.patch
* scripts-decode_stacktrace-look-for-modules-with-kodebug-extension-v2.patch
* scripts-spellingtxt-drop-sepc-from-the-misspelling-list.patch
* scripts-spellingtxt-drop-sepc-from-the-misspelling-list-fix.patch
* scripts-spellingtxt-add-spelling-fix-for-prohibited.patch
* scripts-decode_stacktrace-accept-dash-underscore-in-modules.patch
* scripts-spellingtxt-add-more-spellings-to-spellingtxt.patch
* sh-configs-remove-config_logfs-from-defconfig.patch
* sh-config-remove-left-over-backlight_lcd_support.patch
* debugobjects-move-printk-out-of-db-lock-critical-sections.patch
* fs-ocfs-fix-spelling-mistake-hearbeating-heartbeat.patch
* ocfs2-dlm-use-struct_size-helper.patch
* ocfs2-add-last-unlock-times-in-locking_state.patch
* ocfs2-add-locking-filter-debugfs-file.patch
* ocfs2-add-locking-filter-debugfs-file-fix.patch
* ocfs2-add-first-lock-wait-time-in-locking_state.patch
* ocfs-no-need-to-check-return-value-of-debugfs_create-functions.patch
* ocfs-no-need-to-check-return-value-of-debugfs_create-functions-v2.patch
* ocfs2-clear-zero-in-unaligned-direct-io.patch
* ocfs2-clear-zero-in-unaligned-direct-io-checkpatch-fixes.patch
* ocfs2-wait-for-recovering-done-after-direct-unlock-request.patch
* ocfs2-checkpoint-appending-truncate-log-transaction-before-flushing.patch
* ramfs-support-o_tmpfile.patch
  mm.patch
* mm-slab-validate-cache-membership-under-freelist-hardening.patch
* mm-slab-sanity-check-page-type-when-looking-up-cache.patch
* mm-slab-sanity-check-page-type-when-looking-up-cache-fix.patch
* lkdtm-heap-add-tests-for-freelist-hardening.patch
* mm-slub-avoid-double-string-traverse-in-kmem_cache_flags.patch
* slub-dont-panic-for-memcg-kmem-cache-creation-failure.patch
*

Re: linux-next: manual merge of the rdma tree with Linus' tree

2019-06-19 Thread Doug Ledford

On Thu, 2019-06-20 at 12:10 +1000, Stephen Rothwell wrote:
>   2d3c72ed5041 ("rdma: Remove nes")

Yeah, not much you can do about tree wide patchsets conflicting with a
removal ;-)

-- 
Doug Ledford 
GPG KeyID: B826A3330E572FDD
Fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD


signature.asc
Description: This is a digitally signed message part

Re: linux-next: manual merge of the rdma tree with Linus' tree

2019-06-19 Thread Doug Ledford

On Thu, 2019-06-20 at 12:06 +1000, Stephen Rothwell wrote:
> Hi all,
> 
> Today's linux-next merge of the rdma tree got a conflict in:
> 
>   include/rdma/ib_verbs.h
> 
> between commit:
> 
>   dc1435c00fcd ("RDMA/srp: Rename SRP sysfs name after IB device
> rename trigger")
> 
> from Linus' tree and commit:
> 
>   0e2d00eb6fd4 ("RDMA: Add NLDEV_GET_CHARDEV to allow char dev
> discovery and autoload")
> 
> from the rdma tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your
> tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any
> particularly
> complex conflicts.
> 

Yep, this one was expected.  Thanks.

-- 
Doug Ledford 
GPG KeyID: B826A3330E572FDD
Fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD


signature.asc
Description: This is a digitally signed message part

RE: [RFC net-next 1/5] net: stmmac: introduce IEEE 802.1Qbv configuration functionalities

2019-06-19 Thread Ong, Boon Leong

>> > It looks like most o the TSN_WARN should actually be netdev_dbg().
>> >
>> >Andrew
>>
>> Hi Andrew,
>> This file is targeted for dual licensing which is GPL-2.0 OR BSD-3-Clause.
>> This is the reason why we are using wrappers around the functions so that
>> all the function call is generic.
>
>I don't see why dual licenses should require wrappers. Please explain.
>
>  Thanks
>   Andrew
Agree with the Andrew. We can change those wrapper functions that have
serve the internal development needs for multiple OS scaling reason. 
We will update the kernel codes as suggsted.

[PATCH] samples: make pidfd-metadata fail gracefully on older kernels

2019-06-19 Thread Dmitry V. Levin

Initialize pidfd to an invalid descriptor, to fail gracefully on
those kernels that do not implement CLONE_PIDFD and leave pidfd
unchanged.

Signed-off-by: Dmitry V. Levin 
---
 samples/pidfd/pidfd-metadata.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/samples/pidfd/pidfd-metadata.c b/samples/pidfd/pidfd-metadata.c
index 14b454448429..ff109fdac3a5 100644
--- a/samples/pidfd/pidfd-metadata.c
+++ b/samples/pidfd/pidfd-metadata.c
@@ -83,7 +83,7 @@ static int pidfd_metadata_fd(pid_t pid, int pidfd)
 
 int main(int argc, char *argv[])
 {
-   int pidfd = 0, ret = EXIT_FAILURE;
+   int pidfd = -1, ret = EXIT_FAILURE;
char buf[4096] = { 0 };
pid_t pid;
int procfd, statusfd;
@@ -91,7 +91,11 @@ int main(int argc, char *argv[])
 
pid = pidfd_clone(CLONE_PIDFD, );
if (pid < 0)
-   exit(ret);
+   err(ret, "CLONE_PIDFD");
+   if (pidfd < 0) {
+   warnx("CLONE_PIDFD is not supported by the kernel");
+   goto out;
+   }
 
procfd = pidfd_metadata_fd(pid, pidfd);
close(pidfd);
-- 
ldv

[PATCH V2 4/5] cpufreq: Reuse cpufreq_update_current_freq() in __cpufreq_get()

Their implementations are quite similar, lets modify
cpufreq_update_current_freq() a little and use it from __cpufreq_get().

Also rename cpufreq_update_current_freq() to
cpufreq_verify_current_freq(), as that's what it is doing.

Signed-off-by: Viresh Kumar 
---
 drivers/cpufreq/cpufreq.c | 70 ---
 1 file changed, 28 insertions(+), 42 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 5f5c7a516c74..4556a53fc764 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1547,6 +1547,30 @@ static void cpufreq_out_of_sync(struct cpufreq_policy 
*policy,
cpufreq_freq_transition_end(policy, , 0);
 }
 
+static unsigned int cpufreq_verify_current_freq(struct cpufreq_policy *policy, 
bool update)
+{
+   unsigned int new_freq;
+
+   new_freq = cpufreq_driver->get(policy->cpu);
+   if (!new_freq)
+   return 0;
+
+   /*
+* If fast frequency switching is used with the given policy, the check
+* against policy->cur is pointless, so skip it in that case.
+*/
+   if (policy->fast_switch_enabled || !has_target())
+   return new_freq;
+
+   if (policy->cur != new_freq) {
+   cpufreq_out_of_sync(policy, new_freq);
+   if (update)
+   schedule_work(>update);
+   }
+
+   return new_freq;
+}
+
 /**
  * cpufreq_quick_get - get the CPU frequency (in kHz) from policy->cur
  * @cpu: CPU number
@@ -1602,30 +1626,10 @@ EXPORT_SYMBOL(cpufreq_quick_get_max);
 
 static unsigned int __cpufreq_get(struct cpufreq_policy *policy)
 {
-   unsigned int ret_freq = 0;
-
if (unlikely(policy_is_inactive(policy)))
-   return ret_freq;
-
-   ret_freq = cpufreq_driver->get(policy->cpu);
-
-   /*
-* If fast frequency switching is used with the given policy, the check
-* against policy->cur is pointless, so skip it in that case too.
-*/
-   if (policy->fast_switch_enabled)
-   return ret_freq;
-
-   if (has_target() && ret_freq && policy->cur) {
-   /* verify no discrepancy between actual and
-   saved value exists */
-   if (unlikely(ret_freq != policy->cur)) {
-   cpufreq_out_of_sync(policy, ret_freq);
-   schedule_work(>update);
-   }
-   }
+   return 0;
 
-   return ret_freq;
+   return cpufreq_verify_current_freq(policy, true);
 }
 
 /**
@@ -1652,24 +1656,6 @@ unsigned int cpufreq_get(unsigned int cpu)
 }
 EXPORT_SYMBOL(cpufreq_get);
 
-static unsigned int cpufreq_update_current_freq(struct cpufreq_policy *policy)
-{
-   unsigned int new_freq;
-
-   new_freq = cpufreq_driver->get(policy->cpu);
-   if (!new_freq)
-   return 0;
-
-   if (!policy->cur) {
-   pr_debug("cpufreq: Driver did not initialize current freq\n");
-   policy->cur = new_freq;
-   } else if (policy->cur != new_freq && has_target()) {
-   cpufreq_out_of_sync(policy, new_freq);
-   }
-
-   return new_freq;
-}
-
 static struct subsys_interface cpufreq_interface = {
.name   = "cpufreq",
.subsys = _subsys,
@@ -2151,7 +2137,7 @@ static int cpufreq_start_governor(struct cpufreq_policy 
*policy)
pr_debug("%s: for CPU %u\n", __func__, policy->cpu);
 
if (cpufreq_driver->get)
-   cpufreq_update_current_freq(policy);
+   cpufreq_verify_current_freq(policy, false);
 
if (policy->governor->start) {
ret = policy->governor->start(policy);
@@ -2402,7 +2388,7 @@ void cpufreq_update_policy(unsigned int cpu)
 * -> ask driver for current freq and notify governors about a change
 */
if (cpufreq_driver->get && has_target() &&
-   (cpufreq_suspended || 
WARN_ON(!cpufreq_update_current_freq(policy
+   (cpufreq_suspended || WARN_ON(!cpufreq_verify_current_freq(policy, 
false
goto unlock;
 
pr_debug("updating policy for CPU %u\n", cpu);
-- 
2.21.0.rc0.269.g1a574e7a288b

[PATCH V2 5/5] cpufreq: Avoid calling cpufreq_verify_current_freq() from handle_update()

On some occasions cpufreq_verify_current_freq() schedules a work whose
callback is handle_update(), which further calls cpufreq_update_policy()
which may end up calling cpufreq_verify_current_freq() again.

On the other hand, when cpufreq_update_policy() is called from
handle_update(), the pointer to the cpufreq policy is already available
but we still call cpufreq_cpu_acquire() to get it in
cpufreq_update_policy(), which should be avoided as well.

Fix both the issues by creating another helper
reeval_frequency_limits().

Signed-off-by: Viresh Kumar 
---
 drivers/cpufreq/cpufreq.c | 26 --
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 4556a53fc764..0a73de7aae54 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1115,13 +1115,25 @@ static int cpufreq_add_policy_cpu(struct cpufreq_policy 
*policy, unsigned int cp
return ret;
 }
 
+static void reeval_frequency_limits(struct cpufreq_policy *policy)
+{
+   struct cpufreq_policy new_policy = *policy;
+
+   pr_debug("updating policy for CPU %u\n", policy->cpu);
+
+   new_policy.min = policy->user_policy.min;
+   new_policy.max = policy->user_policy.max;
+
+   cpufreq_set_policy(policy, _policy);
+}
+
 static void handle_update(struct work_struct *work)
 {
struct cpufreq_policy *policy =
container_of(work, struct cpufreq_policy, update);
-   unsigned int cpu = policy->cpu;
-   pr_debug("handle_update for cpu %u called\n", cpu);
-   cpufreq_update_policy(cpu);
+
+   pr_debug("handle_update for cpu %u called\n", policy->cpu);
+   reeval_frequency_limits(policy);
 }
 
 static struct cpufreq_policy *cpufreq_policy_alloc(unsigned int cpu)
@@ -2378,7 +2390,6 @@ int cpufreq_set_policy(struct cpufreq_policy *policy,
 void cpufreq_update_policy(unsigned int cpu)
 {
struct cpufreq_policy *policy = cpufreq_cpu_acquire(cpu);
-   struct cpufreq_policy new_policy;
 
if (!policy)
return;
@@ -2391,12 +2402,7 @@ void cpufreq_update_policy(unsigned int cpu)
(cpufreq_suspended || WARN_ON(!cpufreq_verify_current_freq(policy, 
false
goto unlock;
 
-   pr_debug("updating policy for CPU %u\n", cpu);
-   memcpy(_policy, policy, sizeof(*policy));
-   new_policy.min = policy->user_policy.min;
-   new_policy.max = policy->user_policy.max;
-
-   cpufreq_set_policy(policy, _policy);
+   reeval_frequency_limits(policy);
 
 unlock:
cpufreq_cpu_release(policy);
-- 
2.21.0.rc0.269.g1a574e7a288b

[PATCH V2 3/5] cpufreq: Use has_target() instead of !setpolicy

For code consistency, use has_target() instead of !setpolicy everywhere,
as it is already done at several places. Maybe we should also use
"!has_target()" instead of "cpufreq_driver->setpolicy" where we need to
check if the driver supports setpolicy, so to use only one expression
for this kind of differentiation.

Signed-off-by: Viresh Kumar 
---
 drivers/cpufreq/cpufreq.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 41ac701e324f..5f5c7a516c74 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -632,7 +632,7 @@ static int cpufreq_parse_policy(char *str_governor,
 }
 
 /**
- * cpufreq_parse_governor - parse a governor string only for !setpolicy
+ * cpufreq_parse_governor - parse a governor string only for has_target()
  */
 static int cpufreq_parse_governor(char *str_governor,
  struct cpufreq_policy *policy)
@@ -1301,7 +1301,7 @@ static int cpufreq_online(unsigned int cpu)
policy->max = policy->user_policy.max;
}
 
-   if (cpufreq_driver->get && !cpufreq_driver->setpolicy) {
+   if (cpufreq_driver->get && has_target()) {
policy->cur = cpufreq_driver->get(policy->cpu);
if (!policy->cur) {
pr_err("%s: ->get() failed\n", __func__);
@@ -2401,7 +2401,7 @@ void cpufreq_update_policy(unsigned int cpu)
 * BIOS might change freq behind our back
 * -> ask driver for current freq and notify governors about a change
 */
-   if (cpufreq_driver->get && !cpufreq_driver->setpolicy &&
+   if (cpufreq_driver->get && has_target() &&
(cpufreq_suspended || 
WARN_ON(!cpufreq_update_current_freq(policy
goto unlock;
 
-- 
2.21.0.rc0.269.g1a574e7a288b

[PATCH V2 2/5] cpufreq: Replace few CPUFREQ_CONST_LOOPS checks with has_target()

CPUFREQ_CONST_LOOPS was introduced in a very old commit from pre-2.6
kernel release commit 6a4a93f9c0d5 ("[CPUFREQ] Fix 'out of sync'
issue").

Probably the initial idea was to just avoid these checks for set_policy
type drivers and then things got changed over the years. And it is very
unclear why these checks are there at all.

Replace the CPUFREQ_CONST_LOOPS check with has_target(), which makes
more sense now.

cpufreq_notify_transition() is only called for has_target() type driver
and not for set_policy type, and the check is simply redundant. Remove
it as well.

Also remove () around freq comparison statement as they aren't required
and checkpatch also warns for them.

Signed-off-by: Viresh Kumar 
---
 drivers/cpufreq/cpufreq.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 54befd775bd6..41ac701e324f 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -359,12 +359,10 @@ static void cpufreq_notify_transition(struct 
cpufreq_policy *policy,
 * which is not equal to what the cpufreq core thinks is
 * "old frequency".
 */
-   if (!(cpufreq_driver->flags & CPUFREQ_CONST_LOOPS)) {
-   if (policy->cur && (policy->cur != freqs->old)) {
-   pr_debug("Warning: CPU frequency is %u, cpufreq 
assumed %u kHz\n",
-freqs->old, policy->cur);
-   freqs->old = policy->cur;
-   }
+   if (policy->cur && policy->cur != freqs->old) {
+   pr_debug("Warning: CPU frequency is %u, cpufreq assumed 
%u kHz\n",
+freqs->old, policy->cur);
+   freqs->old = policy->cur;
}
 
srcu_notifier_call_chain(_transition_notifier_list,
@@ -1618,8 +1616,7 @@ static unsigned int __cpufreq_get(struct cpufreq_policy 
*policy)
if (policy->fast_switch_enabled)
return ret_freq;
 
-   if (ret_freq && policy->cur &&
-   !(cpufreq_driver->flags & CPUFREQ_CONST_LOOPS)) {
+   if (has_target() && ret_freq && policy->cur) {
/* verify no discrepancy between actual and
saved value exists */
if (unlikely(ret_freq != policy->cur)) {
-- 
2.21.0.rc0.269.g1a574e7a288b

[PATCH V2 0/5] cpufreq: cleanups

Hi Rafael,

I accumulated these while reworking the freq-constraint series and it
would be nice if these can get in before I send the next version of
freq-constraint stuff.

These are mostly cleanups and code consolidation for better management
of code. Compile and boot tested only.

Thanks.

V1->V2:
- Merged patch 2/6 and 3/6 (now called 2/5).
- Updated commit log of 3/5 as it wasn't clear enough earlier.

Viresh Kumar (5):
  cpufreq: Remove the redundant !setpolicy check
  cpufreq: Replace few CPUFREQ_CONST_LOOPS checks with has_target()
  cpufreq: Use has_target() instead of !setpolicy
  cpufreq: Reuse cpufreq_update_current_freq() in __cpufreq_get()
  cpufreq: Avoid calling cpufreq_verify_current_freq() from
handle_update()

 drivers/cpufreq/cpufreq.c | 115 +-
 1 file changed, 52 insertions(+), 63 deletions(-)

-- 
2.21.0.rc0.269.g1a574e7a288b

[PATCH V2 1/5] cpufreq: Remove the redundant !setpolicy check

cpufreq_start_governor() is only called for !setpolicy case, checking it
again is not required.

Signed-off-by: Viresh Kumar 
---
 drivers/cpufreq/cpufreq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 85ff958e01f1..54befd775bd6 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -2153,7 +2153,7 @@ static int cpufreq_start_governor(struct cpufreq_policy 
*policy)
 
pr_debug("%s: for CPU %u\n", __func__, policy->cpu);
 
-   if (cpufreq_driver->get && !cpufreq_driver->setpolicy)
+   if (cpufreq_driver->get)
cpufreq_update_current_freq(policy);
 
if (policy->governor->start) {
-- 
2.21.0.rc0.269.g1a574e7a288b

Re: linux-next: manual merge of the mlx5-next tree with Linus' tree

Hi all,

On Mon, 17 Jun 2019 12:19:59 +1000 Stephen Rothwell  
wrote:
>
> Hi Leon,
> 
> Today's linux-next merge of the mlx5-next tree got a conflict in:
> 
>   include/linux/mlx5/eswitch.h
> 
> between commit:
> 
>   02f3afd97556 ("net/mlx5: E-Switch, Correct type to u16 for vport_num and 
> int for vport_index")
> 
> from Linus' tree and commit:
> 
>   82b11f071936 ("net/mlx5: Expose eswitch encap mode")
> 
> from the mlx5-next tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
> 
> -- 
> Cheers,
> Stephen Rothwell
> 
> diff --cc include/linux/mlx5/eswitch.h
> index e9a55c0d50fd,174eec0871d9..
> --- a/include/linux/mlx5/eswitch.h
> +++ b/include/linux/mlx5/eswitch.h
> @@@ -61,5 -62,16 +62,16 @@@ void *mlx5_eswitch_uplink_get_proto_dev
>   u8 mlx5_eswitch_mode(struct mlx5_eswitch *esw);
>   struct mlx5_flow_handle *
>   mlx5_eswitch_add_send_to_vport_rule(struct mlx5_eswitch *esw,
>  -int vport, u32 sqn);
>  +u16 vport_num, u32 sqn);
> + 
> + #ifdef CONFIG_MLX5_ESWITCH
> + enum devlink_eswitch_encap_mode
> + mlx5_eswitch_get_encap_mode(const struct mlx5_core_dev *dev);
> + #else  /* CONFIG_MLX5_ESWITCH */
> + static inline enum devlink_eswitch_encap_mode
> + mlx5_eswitch_get_encap_mode(const struct mlx5_core_dev *dev)
> + {
> + return DEVLINK_ESWITCH_ENCAP_MODE_NONE;
> + }
> + #endif /* CONFIG_MLX5_ESWITCH */
>   #endif

This is now a conflict between Linus' tree and the rdma tree.

-- 
Cheers,
Stephen Rothwell


pgpPr0DIcWBCo.pgp
Description: OpenPGP digital signature

Re: [PATCH v2] MAINTAINERS: add CLANG/LLVM BUILD SUPPORT info

2019-06-19 Thread Nathan Chancellor

On Wed, Jun 19, 2019 at 05:19:07PM -0700, 'Nick Desaulniers' via Clang Built 
Linux wrote:
> Add keyword support so that our mailing list gets cc'ed for clang/llvm
> patches. We're pretty active on our mailing list so far as code review.
> There are numerous Googlers like myself that are paid to support
> building the Linux kernel with Clang and LLVM.
> 
> Signed-off-by: Nick Desaulniers 

FWIW, if it is not too late:

Reviewed-by: Nathan Chancellor 

> ---
> Changes V1 -> V2:
> - tabs vs spaces as per Joe Perches
> 
>  MAINTAINERS | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index ef58d9a881ee..f92432452f46 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3940,6 +3940,14 @@ M: Miguel Ojeda 
>  S:   Maintained
>  F:   .clang-format
>  
> +CLANG/LLVM BUILD SUPPORT
> +L:   clang-built-li...@googlegroups.com
> +W:   https://clangbuiltlinux.github.io/
> +B:   https://github.com/ClangBuiltLinux/linux/issues
> +C:   irc://chat.freenode.net/clangbuiltlinux
> +S:   Supported
> +K:   \b(?i:clang|llvm)\b
> +
>  CLEANCACHE API
>  M:   Konrad Rzeszutek Wilk 
>  L:   linux-kernel@vger.kernel.org
> -- 
> 2.22.0.410.gd8fdbe21b5-goog
>

[PATCH v6 10/14] soc: mediatek: Add multiple step bus protection control

Both MT8183 & MT6765 have more control steps of bus protection
than previous project. And there add more bus protection registers
reside at infracfg & smi-common. Also add new APIs for multiple
step bus protection control with more customized arguments.

Signed-off-by: Weiyi Lu 
---
 drivers/soc/mediatek/Makefile   |  2 +-
 drivers/soc/mediatek/mtk-scpsys-ext.c   | 99 +
 drivers/soc/mediatek/mtk-scpsys.c   | 39 +
 include/linux/soc/mediatek/scpsys-ext.h | 39 +
 4 files changed, 168 insertions(+), 11 deletions(-)
 create mode 100644 drivers/soc/mediatek/mtk-scpsys-ext.c
 create mode 100644 include/linux/soc/mediatek/scpsys-ext.h

diff --git a/drivers/soc/mediatek/Makefile b/drivers/soc/mediatek/Makefile
index 64ce5ee..b9dbad6 100644
--- a/drivers/soc/mediatek/Makefile
+++ b/drivers/soc/mediatek/Makefile
@@ -1,4 +1,4 @@
 obj-$(CONFIG_MTK_CMDQ) += mtk-cmdq-helper.o
-obj-$(CONFIG_MTK_INFRACFG) += mtk-infracfg.o
+obj-$(CONFIG_MTK_INFRACFG) += mtk-infracfg.o mtk-scpsys-ext.o
 obj-$(CONFIG_MTK_PMIC_WRAP) += mtk-pmic-wrap.o
 obj-$(CONFIG_MTK_SCPSYS) += mtk-scpsys.o
diff --git a/drivers/soc/mediatek/mtk-scpsys-ext.c 
b/drivers/soc/mediatek/mtk-scpsys-ext.c
new file mode 100644
index 000..b24321e
--- /dev/null
+++ b/drivers/soc/mediatek/mtk-scpsys-ext.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2018 MediaTek Inc.
+ * Author: Owen Chen 
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define MTK_POLL_DELAY_US   10
+#define MTK_POLL_TIMEOUTUSEC_PER_SEC
+
+static int set_bus_protection(struct regmap *map, u32 mask, u32 ack_mask,
+   u32 reg_set, u32 reg_sta, u32 reg_en)
+{
+   u32 val;
+
+   if (reg_set)
+   regmap_write(map, reg_set, mask);
+   else
+   regmap_update_bits(map, reg_en, mask, mask);
+
+   return regmap_read_poll_timeout(map, reg_sta,
+   val, (val & ack_mask) == ack_mask,
+   MTK_POLL_DELAY_US, MTK_POLL_TIMEOUT);
+}
+
+static int clear_bus_protection(struct regmap *map, u32 mask, u32 ack_mask,
+   u32 reg_clr, u32 reg_sta, u32 reg_en)
+{
+   u32 val;
+
+   if (reg_clr)
+   regmap_write(map, reg_clr, mask);
+   else
+   regmap_update_bits(map, reg_en, mask, 0);
+
+   return regmap_read_poll_timeout(map, reg_sta,
+   val, !(val & ack_mask),
+   MTK_POLL_DELAY_US, MTK_POLL_TIMEOUT);
+}
+
+int mtk_scpsys_ext_set_bus_protection(const struct bus_prot *bp_table,
+   struct regmap *infracfg, struct regmap *smi_common)
+{
+   int i;
+
+   for (i = 0; i < MAX_STEPS; i++) {
+   struct regmap *map;
+   int ret;
+
+   if (bp_table[i].type == INVALID_TYPE)
+   continue;
+   else if (bp_table[i].type == IFR_TYPE)
+   map = infracfg;
+   else if (bp_table[i].type == SMI_TYPE)
+   map = smi_common;
+
+   ret = set_bus_protection(map,
+   bp_table[i].mask, bp_table[i].mask,
+   bp_table[i].set_ofs, bp_table[i].sta_ofs,
+   bp_table[i].en_ofs);
+
+   if (ret)
+   return ret;
+   }
+
+   return 0;
+}
+
+int mtk_scpsys_ext_clear_bus_protection(const struct bus_prot *bp_table,
+   struct regmap *infracfg, struct regmap *smi_common)
+{
+   int i;
+
+   for (i = MAX_STEPS - 1; i >= 0; i--) {
+   struct regmap *map;
+   int ret;
+
+   if (bp_table[i].type == INVALID_TYPE)
+   continue;
+   else if (bp_table[i].type == IFR_TYPE)
+   map = infracfg;
+   else if (bp_table[i].type == SMI_TYPE)
+   map = smi_common;
+
+   ret = clear_bus_protection(map,
+   bp_table[i].mask, bp_table[i].clr_ack_mask,
+   bp_table[i].clr_ofs, bp_table[i].sta_ofs,
+   bp_table[i].en_ofs);
+
+   if (ret)
+   return ret;
+   }
+
+   return 0;
+}
diff --git a/drivers/soc/mediatek/mtk-scpsys.c 
b/drivers/soc/mediatek/mtk-scpsys.c
index 4a0752e..10c2440 100644
--- a/drivers/soc/mediatek/mtk-scpsys.c
+++ b/drivers/soc/mediatek/mtk-scpsys.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -120,6 +121,7 @@ enum clk_id {
  * @basic_clk_id: provide the same purpose with field "clk_id"
  *by declaring basic clock prefix name rather than clk_id.
  * @caps: The flag for active wake-up action.
+ * @bp_table: The mask table for multiple step bus protection.
  */
 struct scp_domain_data {
const char *name;
@@ -131,6 +133,7 @@ struct scp_domain_data {
enum clk_id

[PATCH v6 08/14] soc: mediatek: Refactor bus protection control

Put bus protection enable and disable control in separate functions.

Signed-off-by: Weiyi Lu 
---
 drivers/soc/mediatek/mtk-scpsys.c | 44 ++-
 1 file changed, 30 insertions(+), 14 deletions(-)

diff --git a/drivers/soc/mediatek/mtk-scpsys.c 
b/drivers/soc/mediatek/mtk-scpsys.c
index 58627ab..178198b 100644
--- a/drivers/soc/mediatek/mtk-scpsys.c
+++ b/drivers/soc/mediatek/mtk-scpsys.c
@@ -274,6 +274,30 @@ static int scpsys_sram_disable(struct scp_domain *scpd, 
void __iomem *ctl_addr)
MTK_POLL_DELAY_US, MTK_POLL_TIMEOUT);
 }
 
+static int scpsys_bus_protect_enable(struct scp_domain *scpd)
+{
+   struct scp *scp = scpd->scp;
+
+   if (!scpd->data->bus_prot_mask)
+   return 0;
+
+   return mtk_infracfg_set_bus_protection(scp->infracfg,
+   scpd->data->bus_prot_mask,
+   scp->bus_prot_reg_update);
+}
+
+static int scpsys_bus_protect_disable(struct scp_domain *scpd)
+{
+   struct scp *scp = scpd->scp;
+
+   if (!scpd->data->bus_prot_mask)
+   return 0;
+
+   return mtk_infracfg_clear_bus_protection(scp->infracfg,
+   scpd->data->bus_prot_mask,
+   scp->bus_prot_reg_update);
+}
+
 static int scpsys_power_on(struct generic_pm_domain *genpd)
 {
struct scp_domain *scpd = container_of(genpd, struct scp_domain, genpd);
@@ -316,13 +340,9 @@ static int scpsys_power_on(struct generic_pm_domain *genpd)
if (ret < 0)
goto err_pwr_ack;
 
-   if (scpd->data->bus_prot_mask) {
-   ret = mtk_infracfg_clear_bus_protection(scp->infracfg,
-   scpd->data->bus_prot_mask,
-   scp->bus_prot_reg_update);
-   if (ret)
-   goto err_pwr_ack;
-   }
+   ret = scpsys_bus_protect_disable(scpd);
+   if (ret < 0)
+   goto err_pwr_ack;
 
return 0;
 
@@ -344,13 +364,9 @@ static int scpsys_power_off(struct generic_pm_domain 
*genpd)
u32 val;
int ret, tmp;
 
-   if (scpd->data->bus_prot_mask) {
-   ret = mtk_infracfg_set_bus_protection(scp->infracfg,
-   scpd->data->bus_prot_mask,
-   scp->bus_prot_reg_update);
-   if (ret)
-   goto out;
-   }
+   ret = scpsys_bus_protect_enable(scpd);
+   if (ret < 0)
+   goto out;
 
ret = scpsys_sram_disable(scpd, ctl_addr);
if (ret < 0)
-- 
1.8.1.1.dirty

[PATCH v6 09/14] soc: mediatek: Add basic_clk_id to scp_power_data

Try to stop extending the clk_id or clk_names if there are
more and more new BASIC clocks. To get its own clocks by the
basic_clk_id of each power domain.

Signed-off-by: Weiyi Lu 
---
 drivers/soc/mediatek/mtk-scpsys.c | 29 +
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/soc/mediatek/mtk-scpsys.c 
b/drivers/soc/mediatek/mtk-scpsys.c
index 178198b..4a0752e 100644
--- a/drivers/soc/mediatek/mtk-scpsys.c
+++ b/drivers/soc/mediatek/mtk-scpsys.c
@@ -117,6 +117,8 @@ enum clk_id {
  * @sram_pdn_ack_bits: The mask for sram power control acked bits.
  * @bus_prot_mask: The mask for single step bus protection.
  * @clk_id: The basic clocks required by this power domain.
+ * @basic_clk_id: provide the same purpose with field "clk_id"
+ *by declaring basic clock prefix name rather than clk_id.
  * @caps: The flag for active wake-up action.
  */
 struct scp_domain_data {
@@ -127,6 +129,7 @@ struct scp_domain_data {
u32 sram_pdn_ack_bits;
u32 bus_prot_mask;
enum clk_id clk_id[MAX_CLKS];
+   const char *basic_clk_id[MAX_CLKS];
u8 caps;
 };
 
@@ -490,16 +493,26 @@ static struct scp *init_scp(struct platform_device *pdev,
 
scpd->data = data;
 
-   for (j = 0; j < MAX_CLKS && data->clk_id[j]; j++) {
-   struct clk *c = clk[data->clk_id[j]];
+   if (data->clk_id[0]) {
+   WARN_ON(data->basic_clk_id[0]);
 
-   if (IS_ERR(c)) {
-   dev_err(>dev, "%s: clk unavailable\n",
-   data->name);
-   return ERR_CAST(c);
-   }
+   for (j = 0; j < MAX_CLKS && data->clk_id[j]; j++) {
+   struct clk *c = clk[data->clk_id[j]];
+
+   if (IS_ERR(c)) {
+   dev_err(>dev,
+   "%s: clk unavailable\n",
+   data->name);
+   return ERR_CAST(c);
+   }
 
-   scpd->clk[j] = c;
+   scpd->clk[j] = c;
+   }
+   } else if (data->basic_clk_id[0]) {
+   for (j = 0; j < MAX_CLKS &&
+   data->basic_clk_id[j]; j++)
+   scpd->clk[j] = devm_clk_get(>dev,
+   data->basic_clk_id[j]);
}
 
genpd->name = data->name;
-- 
1.8.1.1.dirty

[PATCH v6 05/14] soc: mediatek: Refactor regulator control

Put regulator enable and disable control in separate functions.

Signed-off-by: Weiyi Lu 
---
 drivers/soc/mediatek/mtk-scpsys.c | 32 +++-
 1 file changed, 23 insertions(+), 9 deletions(-)

diff --git a/drivers/soc/mediatek/mtk-scpsys.c 
b/drivers/soc/mediatek/mtk-scpsys.c
index f775b1b..1a6a4ab 100644
--- a/drivers/soc/mediatek/mtk-scpsys.c
+++ b/drivers/soc/mediatek/mtk-scpsys.c
@@ -191,6 +191,22 @@ static int scpsys_domain_is_on(struct scp_domain *scpd)
return -EINVAL;
 }
 
+static int scpsys_regulator_enable(struct scp_domain *scpd)
+{
+   if (!scpd->supply)
+   return 0;
+
+   return regulator_enable(scpd->supply);
+}
+
+static int scpsys_regulator_disable(struct scp_domain *scpd)
+{
+   if (!scpd->supply)
+   return 0;
+
+   return regulator_disable(scpd->supply);
+}
+
 static int scpsys_power_on(struct generic_pm_domain *genpd)
 {
struct scp_domain *scpd = container_of(genpd, struct scp_domain, genpd);
@@ -201,11 +217,9 @@ static int scpsys_power_on(struct generic_pm_domain *genpd)
int ret, tmp;
int i;
 
-   if (scpd->supply) {
-   ret = regulator_enable(scpd->supply);
-   if (ret)
-   return ret;
-   }
+   ret = scpsys_regulator_enable(scpd);
+   if (ret < 0)
+   return ret;
 
for (i = 0; i < MAX_CLKS && scpd->clk[i]; i++) {
ret = clk_prepare_enable(scpd->clk[i]);
@@ -273,8 +287,7 @@ static int scpsys_power_on(struct generic_pm_domain *genpd)
clk_disable_unprepare(scpd->clk[i]);
}
 err_clk:
-   if (scpd->supply)
-   regulator_disable(scpd->supply);
+   scpsys_regulator_disable(scpd);
 
dev_err(scp->dev, "Failed to power on domain %s\n", genpd->name);
 
@@ -333,8 +346,9 @@ static int scpsys_power_off(struct generic_pm_domain *genpd)
for (i = 0; i < MAX_CLKS && scpd->clk[i]; i++)
clk_disable_unprepare(scpd->clk[i]);
 
-   if (scpd->supply)
-   regulator_disable(scpd->supply);
+   ret = scpsys_regulator_disable(scpd);
+   if (ret < 0)
+   goto out;
 
return 0;
 
-- 
1.8.1.1.dirty

[PATCH v6 07/14] soc: mediatek: Refactor sram control

Put sram enable and disable control in separate functions.

Signed-off-by: Weiyi Lu 
Reviewed-by: Nicolas Boichat 
---
 drivers/soc/mediatek/mtk-scpsys.c | 79 +--
 1 file changed, 51 insertions(+), 28 deletions(-)

diff --git a/drivers/soc/mediatek/mtk-scpsys.c 
b/drivers/soc/mediatek/mtk-scpsys.c
index 5b73e4e..58627ab 100644
--- a/drivers/soc/mediatek/mtk-scpsys.c
+++ b/drivers/soc/mediatek/mtk-scpsys.c
@@ -230,12 +230,55 @@ static int scpsys_clk_enable(struct clk *clk[], int 
max_num)
return ret;
 }
 
+static int scpsys_sram_enable(struct scp_domain *scpd, void __iomem *ctl_addr)
+{
+   u32 val;
+   u32 pdn_ack = scpd->data->sram_pdn_ack_bits;
+   int tmp;
+
+   val = readl(ctl_addr) & ~scpd->data->sram_pdn_bits;
+   writel(val, ctl_addr);
+
+   /* Either wait until SRAM_PDN_ACK all 0 or have a force wait */
+   if (MTK_SCPD_CAPS(scpd, MTK_SCPD_FWAIT_SRAM)) {
+   /*
+* Currently, MTK_SCPD_FWAIT_SRAM is necessary only for
+* MT7622_POWER_DOMAIN_WB and thus just a trivial setup
+* is applied here.
+*/
+   usleep_range(12000, 12100);
+   } else {
+   /* Either wait until SRAM_PDN_ACK all 1 or 0 */
+   int ret = readl_poll_timeout(ctl_addr, tmp,
+   (tmp & pdn_ack) == 0,
+   MTK_POLL_DELAY_US, MTK_POLL_TIMEOUT);
+   if (ret < 0)
+   return ret;
+   }
+
+   return 0;
+}
+
+static int scpsys_sram_disable(struct scp_domain *scpd, void __iomem *ctl_addr)
+{
+   u32 val;
+   u32 pdn_ack = scpd->data->sram_pdn_ack_bits;
+   int tmp;
+
+   val = readl(ctl_addr) | scpd->data->sram_pdn_bits;
+   writel(val, ctl_addr);
+
+   /* Either wait until SRAM_PDN_ACK all 1 or 0 */
+   return readl_poll_timeout(ctl_addr, tmp,
+   (tmp & pdn_ack) == pdn_ack,
+   MTK_POLL_DELAY_US, MTK_POLL_TIMEOUT);
+}
+
 static int scpsys_power_on(struct generic_pm_domain *genpd)
 {
struct scp_domain *scpd = container_of(genpd, struct scp_domain, genpd);
struct scp *scp = scpd->scp;
void __iomem *ctl_addr = scp->base + scpd->data->ctl_offs;
-   u32 pdn_ack = scpd->data->sram_pdn_ack_bits;
u32 val;
int ret, tmp;
 
@@ -247,6 +290,7 @@ static int scpsys_power_on(struct generic_pm_domain *genpd)
if (ret)
goto err_clk;
 
+   /* subsys power on */
val = readl(ctl_addr);
val |= PWR_ON_BIT;
writel(val, ctl_addr);
@@ -268,24 +312,9 @@ static int scpsys_power_on(struct generic_pm_domain *genpd)
val |= PWR_RST_B_BIT;
writel(val, ctl_addr);
 
-   val &= ~scpd->data->sram_pdn_bits;
-   writel(val, ctl_addr);
-
-   /* Either wait until SRAM_PDN_ACK all 0 or have a force wait */
-   if (MTK_SCPD_CAPS(scpd, MTK_SCPD_FWAIT_SRAM)) {
-   /*
-* Currently, MTK_SCPD_FWAIT_SRAM is necessary only for
-* MT7622_POWER_DOMAIN_WB and thus just a trivial setup is
-* applied here.
-*/
-   usleep_range(12000, 12100);
-
-   } else {
-   ret = readl_poll_timeout(ctl_addr, tmp, (tmp & pdn_ack) == 0,
-MTK_POLL_DELAY_US, MTK_POLL_TIMEOUT);
-   if (ret < 0)
-   goto err_pwr_ack;
-   }
+   ret = scpsys_sram_enable(scpd, ctl_addr);
+   if (ret < 0)
+   goto err_pwr_ack;
 
if (scpd->data->bus_prot_mask) {
ret = mtk_infracfg_clear_bus_protection(scp->infracfg,
@@ -312,7 +341,6 @@ static int scpsys_power_off(struct generic_pm_domain *genpd)
struct scp_domain *scpd = container_of(genpd, struct scp_domain, genpd);
struct scp *scp = scpd->scp;
void __iomem *ctl_addr = scp->base + scpd->data->ctl_offs;
-   u32 pdn_ack = scpd->data->sram_pdn_ack_bits;
u32 val;
int ret, tmp;
 
@@ -324,17 +352,12 @@ static int scpsys_power_off(struct generic_pm_domain 
*genpd)
goto out;
}
 
-   val = readl(ctl_addr);
-   val |= scpd->data->sram_pdn_bits;
-   writel(val, ctl_addr);
-
-   /* wait until SRAM_PDN_ACK all 1 */
-   ret = readl_poll_timeout(ctl_addr, tmp, (tmp & pdn_ack) == pdn_ack,
-MTK_POLL_DELAY_US, MTK_POLL_TIMEOUT);
+   ret = scpsys_sram_disable(scpd, ctl_addr);
if (ret < 0)
goto out;
 
-   val |= PWR_ISO_BIT;
+   /* subsys power off */
+   val = readl(ctl_addr) | PWR_ISO_BIT;
writel(val, ctl_addr);
 
val &= ~PWR_RST_B_BIT;
-- 
1.8.1.1.dirty

[PATCH v6 11/14] soc: mediatek: Add subsys clock control for bus protection

Add subsys CG control flow before/after the bus protect control
due to bus protection need SMI bus relative CGs enabled to feedback
its ack.

Signed-off-by: Weiyi Lu 
---
 drivers/soc/mediatek/mtk-scpsys.c | 72 +--
 1 file changed, 70 insertions(+), 2 deletions(-)

diff --git a/drivers/soc/mediatek/mtk-scpsys.c 
b/drivers/soc/mediatek/mtk-scpsys.c
index 10c2440..74fd981 100644
--- a/drivers/soc/mediatek/mtk-scpsys.c
+++ b/drivers/soc/mediatek/mtk-scpsys.c
@@ -108,6 +108,7 @@ enum clk_id {
 };
 
 #define MAX_CLKS   3
+#define MAX_SUBSYS_CLKS 10
 
 /**
  * struct scp_domain_data - scp domain data for power on/off flow
@@ -120,6 +121,8 @@ enum clk_id {
  * @clk_id: The basic clocks required by this power domain.
  * @basic_clk_id: provide the same purpose with field "clk_id"
  *by declaring basic clock prefix name rather than clk_id.
+ * @subsys_clk_prefix: The prefix name of the clocks need to be enabled
+ * before releasing bus protection.
  * @caps: The flag for active wake-up action.
  * @bp_table: The mask table for multiple step bus protection.
  */
@@ -132,6 +135,7 @@ struct scp_domain_data {
u32 bus_prot_mask;
enum clk_id clk_id[MAX_CLKS];
const char *basic_clk_id[MAX_CLKS];
+   const char *subsys_clk_prefix;
u8 caps;
struct bus_prot bp_table[MAX_STEPS];
 };
@@ -142,6 +146,7 @@ struct scp_domain {
struct generic_pm_domain genpd;
struct scp *scp;
struct clk *clk[MAX_CLKS];
+   struct clk *subsys_clk[MAX_SUBSYS_CLKS];
const struct scp_domain_data *data;
struct regulator *supply;
 };
@@ -347,16 +352,22 @@ static int scpsys_power_on(struct generic_pm_domain 
*genpd)
val |= PWR_RST_B_BIT;
writel(val, ctl_addr);
 
-   ret = scpsys_sram_enable(scpd, ctl_addr);
+   ret = scpsys_clk_enable(scpd->subsys_clk, MAX_SUBSYS_CLKS);
if (ret < 0)
goto err_pwr_ack;
 
+   ret = scpsys_sram_enable(scpd, ctl_addr);
+   if (ret < 0)
+   goto err_sram;
+
ret = scpsys_bus_protect_disable(scpd);
if (ret < 0)
-   goto err_pwr_ack;
+   goto err_sram;
 
return 0;
 
+err_sram:
+   scpsys_clk_disable(scpd->subsys_clk, MAX_SUBSYS_CLKS);
 err_pwr_ack:
scpsys_clk_disable(scpd->clk, MAX_CLKS);
 err_clk:
@@ -383,6 +394,8 @@ static int scpsys_power_off(struct generic_pm_domain *genpd)
if (ret < 0)
goto out;
 
+   scpsys_clk_disable(scpd->subsys_clk, MAX_SUBSYS_CLKS);
+
/* subsys power off */
val = readl(ctl_addr) | PWR_ISO_BIT;
writel(val, ctl_addr);
@@ -419,6 +432,48 @@ static int scpsys_power_off(struct generic_pm_domain 
*genpd)
return ret;
 }
 
+static int init_subsys_clks(struct platform_device *pdev,
+   const char *prefix, struct clk **clk)
+{
+   struct device_node *node = pdev->dev.of_node;
+   u32 prefix_len, sub_clk_cnt = 0;
+   struct property *prop;
+   const char *clk_name;
+
+   if (!node) {
+   dev_err(>dev, "Cannot find scpsys node: %ld\n",
+   PTR_ERR(node));
+   return PTR_ERR(node);
+   }
+
+   prefix_len = strlen(prefix);
+
+   of_property_for_each_string(node, "clock-names", prop, clk_name) {
+   if (!strncmp(clk_name, prefix, prefix_len) &&
+   (clk_name[prefix_len] == '-')) {
+   if (sub_clk_cnt >= MAX_SUBSYS_CLKS) {
+   dev_err(>dev,
+   "subsys clk out of range %d\n",
+   sub_clk_cnt);
+   return -ENOMEM;
+   }
+
+   clk[sub_clk_cnt] = devm_clk_get(>dev,
+   clk_name);
+
+   if (IS_ERR(clk)) {
+   dev_err(>dev,
+   "Subsys clk read fail %ld\n",
+   PTR_ERR(clk));
+   return PTR_ERR(clk);
+   }
+   sub_clk_cnt++;
+   }
+   }
+
+   return sub_clk_cnt;
+}
+
 static void init_clks(struct platform_device *pdev, struct clk **clk)
 {
int i;
@@ -506,6 +561,7 @@ static struct scp *init_scp(struct platform_device *pdev,
struct scp_domain *scpd = >domains[i];
struct generic_pm_domain *genpd = >genpd;
const struct scp_domain_data *data = _domain_data[i];
+   int clk_cnt;
 
pd_data->domains[i] = genpd;
scpd->scp = scp;
@@ -534,6 +590,18 @@ static struct scp *init_scp(struct platform_device *pdev,
data->basic_clk_id[j]);
}
 
+   if

[PATCH v6 14/14] arm64: dts: Add power controller device node of MT8183

Add power controller node and smi-common node for MT8183
In scpsys node, it contains clocks and regmapping of
infracfg and smi-common for bus protection.

Signed-off-by: Weiyi Lu 
---
 arch/arm64/boot/dts/mediatek/mt8183.dtsi | 62 
 1 file changed, 62 insertions(+)

diff --git a/arch/arm64/boot/dts/mediatek/mt8183.dtsi 
b/arch/arm64/boot/dts/mediatek/mt8183.dtsi
index 08274bf..75c4881 100644
--- a/arch/arm64/boot/dts/mediatek/mt8183.dtsi
+++ b/arch/arm64/boot/dts/mediatek/mt8183.dtsi
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 
 / {
compatible = "mediatek,mt8183";
@@ -196,6 +197,62 @@
#clock-cells = <1>;
};
 
+   scpsys: syscon@10006000 {
+   compatible = "mediatek,mt8183-scpsys", "syscon";
+   #power-domain-cells = <1>;
+   reg = <0 0x10006000 0 0x1000>;
+   clocks = < CLK_TOP_MUX_AUD_INTBUS>,
+< CLK_INFRA_AUDIO>,
+< CLK_INFRA_AUDIO_26M_BCLK>,
+< CLK_TOP_MUX_MFG>,
+< CLK_TOP_MUX_MM>,
+< CLK_TOP_MUX_CAM>,
+< CLK_TOP_MUX_IMG>,
+< CLK_TOP_MUX_IPU_IF>,
+< CLK_TOP_MUX_DSP>,
+< CLK_TOP_MUX_DSP1>,
+< CLK_TOP_MUX_DSP2>,
+< CLK_MM_SMI_COMMON>,
+< CLK_MM_SMI_LARB0>,
+< CLK_MM_SMI_LARB1>,
+< CLK_MM_GALS_COMM0>,
+< CLK_MM_GALS_COMM1>,
+< CLK_MM_GALS_CCU2MM>,
+< CLK_MM_GALS_IPU12MM>,
+< CLK_MM_GALS_IMG2MM>,
+< CLK_MM_GALS_CAM2MM>,
+< CLK_MM_GALS_IPU2MM>,
+< CLK_IMG_LARB5>,
+< CLK_IMG_LARB2>,
+< CLK_CAM_LARB6>,
+< CLK_CAM_LARB3>,
+< CLK_CAM_SENINF>,
+< CLK_CAM_CAMSV0>,
+< CLK_CAM_CAMSV1>,
+< CLK_CAM_CAMSV2>,
+< CLK_CAM_CCU>,
+<_conn CLK_IPU_CONN_IPU>,
+<_conn CLK_IPU_CONN_AHB>,
+<_conn CLK_IPU_CONN_AXI>,
+<_conn CLK_IPU_CONN_ISP>,
+<_conn CLK_IPU_CONN_CAM_ADL>,
+<_conn CLK_IPU_CONN_IMG_ADL>;
+   clock-names = "audio", "audio1", "audio2",
+ "mfg", "mm", "cam",
+ "isp", "vpu", "vpu1",
+ "vpu2", "vpu3", "mm-0",
+ "mm-1", "mm-2", "mm-3",
+ "mm-4", "mm-5", "mm-6",
+ "mm-7", "mm-8", "mm-9",
+ "isp-0", "isp-1", "cam-0",
+ "cam-1", "cam-2", "cam-3",
+ "cam-4", "cam-5", "cam-6",
+ "vpu-0", "vpu-1", "vpu-2",
+ "vpu-3", "vpu-4", "vpu-5";
+   infracfg = <>;
+   smi_comm = <_common>;
+   };
+
apmixedsys: syscon@1000c000 {
compatible = "mediatek,mt8183-apmixedsys", "syscon";
reg = <0 0x1000c000 0 0x1000>;
@@ -260,6 +317,11 @@
#clock-cells = <1>;
};
 
+   smi_common: smi@14019000 {
+   compatible = "mediatek,mt8183-smi-common", "syscon";
+   reg = <0 0x14019000 0 0x1000>;
+   };
+
imgsys: syscon@1502 {
compatible = "mediatek,mt8183-imgsys", "syscon";
reg = <0 0x1502 0 0x1000>;
-- 
1.8.1.1.dirty

[PATCH v6 02/14] dt-bindings: soc: Add MT8183 power dt-bindings

Add power dt-bindings of MT8183 and introduces "BASIC" and
"SUBSYS" clock types in binding document.
The "BASIC" type is compatible to the original power control with
clock name [a-z]+[0-9]*, e.g. mm, vpu1.
The "SUBSYS" type is used for bus protection control with clock
name [a-z]+-[0-9]+, e.g. isp-0, cam-1.

Signed-off-by: Weiyi Lu 
---
 .../devicetree/bindings/soc/mediatek/scpsys.txt| 14 
 include/dt-bindings/power/mt8183-power.h   | 26 ++
 2 files changed, 40 insertions(+)
 create mode 100644 include/dt-bindings/power/mt8183-power.h

diff --git a/Documentation/devicetree/bindings/soc/mediatek/scpsys.txt 
b/Documentation/devicetree/bindings/soc/mediatek/scpsys.txt
index 876693a..00eab7e 100644
--- a/Documentation/devicetree/bindings/soc/mediatek/scpsys.txt
+++ b/Documentation/devicetree/bindings/soc/mediatek/scpsys.txt
@@ -14,6 +14,7 @@ power/power_domain.txt. It provides the power domains defined 
in
 - include/dt-bindings/power/mt2701-power.h
 - include/dt-bindings/power/mt2712-power.h
 - include/dt-bindings/power/mt7622-power.h
+- include/dt-bindings/power/mt8183-power.h
 
 Required properties:
 - compatible: Should be one of:
@@ -25,18 +26,31 @@ Required properties:
- "mediatek,mt7623a-scpsys": For MT7623A SoC
- "mediatek,mt7629-scpsys", "mediatek,mt7622-scpsys": For MT7629 SoC
- "mediatek,mt8173-scpsys"
+   - "mediatek,mt8183-scpsys"
 - #power-domain-cells: Must be 1
 - reg: Address range of the SCPSYS unit
 - infracfg: must contain a phandle to the infracfg controller
 - clock, clock-names: clocks according to the common clock binding.
   These are clocks which hardware needs to be
   enabled before enabling certain power domains.
+  The new clock type "BASIC" belongs to the type above.
+  As to the new clock type "SUBSYS" needs to be
+  enabled before releasing bus protection.
Required clocks for MT2701 or MT7623: "mm", "mfg", "ethif"
Required clocks for MT2712: "mm", "mfg", "venc", "jpgdec", "audio", 
"vdec"
Required clocks for MT6797: "mm", "mfg", "vdec"
Required clocks for MT7622 or MT7629: "hif_sel"
Required clocks for MT7623A: "ethif"
Required clocks for MT8173: "mm", "mfg", "venc", "venc_lt"
+   Required clocks for MT8183: BASIC: "audio", "mfg", "mm", "cam", "isp",
+  "vpu", "vpu1", "vpu2", "vpu3"
+   SUBSYS: "mm-0", "mm-1", "mm-2", "mm-3",
+   "mm-4", "mm-5", "mm-6", "mm-7",
+   "mm-8", "mm-9", "isp-0", "isp-1",
+   "cam-0", "cam-1", "cam-2", "cam-3",
+   "cam-4", "cam-5", "cam-6", "vpu-0",
+   "vpu-1", "vpu-2", "vpu-3", "vpu-4",
+   "vpu-5"
 
 Optional properties:
 - vdec-supply: Power supply for the vdec power domain
diff --git a/include/dt-bindings/power/mt8183-power.h 
b/include/dt-bindings/power/mt8183-power.h
new file mode 100644
index 000..5c0c8c7
--- /dev/null
+++ b/include/dt-bindings/power/mt8183-power.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (c) 2018 MediaTek Inc.
+ * Author: Weiyi Lu 
+ */
+
+#ifndef _DT_BINDINGS_POWER_MT8183_POWER_H
+#define _DT_BINDINGS_POWER_MT8183_POWER_H
+
+#define MT8183_POWER_DOMAIN_AUDIO  0
+#define MT8183_POWER_DOMAIN_CONN   1
+#define MT8183_POWER_DOMAIN_MFG_ASYNC  2
+#define MT8183_POWER_DOMAIN_MFG3
+#define MT8183_POWER_DOMAIN_MFG_CORE0  4
+#define MT8183_POWER_DOMAIN_MFG_CORE1  5
+#define MT8183_POWER_DOMAIN_MFG_2D 6
+#define MT8183_POWER_DOMAIN_DISP   7
+#define MT8183_POWER_DOMAIN_CAM8
+#define MT8183_POWER_DOMAIN_ISP9
+#define MT8183_POWER_DOMAIN_VDEC   10
+#define MT8183_POWER_DOMAIN_VENC   11
+#define MT8183_POWER_DOMAIN_VPU_TOP12
+#define MT8183_POWER_DOMAIN_VPU_CORE0  13
+#define MT8183_POWER_DOMAIN_VPU_CORE1  14
+
+#endif /* _DT_BINDINGS_POWER_MT8183_POWER_H */
-- 
1.8.1.1.dirty

[PATCH v6 13/14] soc: mediatek: Add MT8183 scpsys support

Add scpsys driver for MT8183

Signed-off-by: Weiyi Lu 
---
 drivers/soc/mediatek/mtk-scpsys.c | 226 ++
 1 file changed, 226 insertions(+)

diff --git a/drivers/soc/mediatek/mtk-scpsys.c 
b/drivers/soc/mediatek/mtk-scpsys.c
index d3fdb3f..ea5a221 100644
--- a/drivers/soc/mediatek/mtk-scpsys.c
+++ b/drivers/soc/mediatek/mtk-scpsys.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define MTK_POLL_DELAY_US   10
 #define MTK_POLL_TIMEOUTUSEC_PER_SEC
@@ -1129,6 +1130,217 @@ static void mtk_register_power_domains(struct 
platform_device *pdev,
{MT8173_POWER_DOMAIN_MFG_2D, MT8173_POWER_DOMAIN_MFG},
 };
 
+/*
+ * MT8183 power domain support
+ */
+
+static const struct scp_domain_data scp_domain_data_mt8183[] = {
+   [MT8183_POWER_DOMAIN_AUDIO] = {
+   .name = "audio",
+   .sta_mask = PWR_STATUS_AUDIO,
+   .ctl_offs = 0x0314,
+   .sram_pdn_bits = GENMASK(11, 8),
+   .sram_pdn_ack_bits = GENMASK(15, 12),
+   .basic_clk_id = {"audio", "audio1", "audio2"},
+   },
+   [MT8183_POWER_DOMAIN_CONN] = {
+   .name = "conn",
+   .sta_mask = PWR_STATUS_CONN,
+   .ctl_offs = 0x032c,
+   .sram_pdn_bits = 0,
+   .sram_pdn_ack_bits = 0,
+   .bp_table = {
+   BUS_PROT(IFR_TYPE, 0x2a0, 0x2a4, 0, 0x228,
+   BIT(13) | BIT(14), BIT(13) | BIT(14)),
+   },
+   },
+   [MT8183_POWER_DOMAIN_MFG_ASYNC] = {
+   .name = "mfg_async",
+   .sta_mask = PWR_STATUS_MFG_ASYNC,
+   .ctl_offs = 0x0334,
+   .sram_pdn_bits = 0,
+   .sram_pdn_ack_bits = 0,
+   .basic_clk_id = {"mfg"},
+   },
+   [MT8183_POWER_DOMAIN_MFG] = {
+   .name = "mfg",
+   .sta_mask = PWR_STATUS_MFG,
+   .ctl_offs = 0x0338,
+   .sram_pdn_bits = GENMASK(8, 8),
+   .sram_pdn_ack_bits = GENMASK(12, 12),
+   },
+   [MT8183_POWER_DOMAIN_MFG_CORE0] = {
+   .name = "mfg_core0",
+   .sta_mask = BIT(7),
+   .ctl_offs = 0x034c,
+   .sram_pdn_bits = GENMASK(8, 8),
+   .sram_pdn_ack_bits = GENMASK(12, 12),
+   },
+   [MT8183_POWER_DOMAIN_MFG_CORE1] = {
+   .name = "mfg_core1",
+   .sta_mask = BIT(20),
+   .ctl_offs = 0x0310,
+   .sram_pdn_bits = GENMASK(8, 8),
+   .sram_pdn_ack_bits = GENMASK(12, 12),
+   },
+   [MT8183_POWER_DOMAIN_MFG_2D] = {
+   .name = "mfg_2d",
+   .sta_mask = PWR_STATUS_MFG_2D,
+   .ctl_offs = 0x0348,
+   .sram_pdn_bits = GENMASK(8, 8),
+   .sram_pdn_ack_bits = GENMASK(12, 12),
+   .bp_table = {
+   BUS_PROT(IFR_TYPE, 0x2a8, 0x2ac, 0, 0x258,
+   BIT(19) | BIT(20) | BIT(21),
+   BIT(19) | BIT(20) | BIT(21)),
+   BUS_PROT(IFR_TYPE, 0x2a0, 0x2a4, 0, 0x228,
+   BIT(21) | BIT(22), BIT(21) | BIT(22)),
+   },
+   },
+   [MT8183_POWER_DOMAIN_DISP] = {
+   .name = "disp",
+   .sta_mask = PWR_STATUS_DISP,
+   .ctl_offs = 0x030c,
+   .sram_pdn_bits = GENMASK(8, 8),
+   .sram_pdn_ack_bits = GENMASK(12, 12),
+   .basic_clk_id = {"mm"},
+   .subsys_clk_prefix = "mm",
+   .bp_table = {
+   BUS_PROT(IFR_TYPE, 0x2a8, 0x2ac, 0, 0x258,
+   BIT(16) | BIT(17), BIT(16) | BIT(17)),
+   BUS_PROT(IFR_TYPE, 0x2a0, 0x2a4, 0, 0x228,
+   BIT(10) | BIT(11), BIT(10) | BIT(11)),
+   BUS_PROT(SMI_TYPE, 0x3c4, 0x3c8, 0, 0x3c0,
+   GENMASK(7, 0), GENMASK(7, 0)),
+   },
+   },
+   [MT8183_POWER_DOMAIN_CAM] = {
+   .name = "cam",
+   .sta_mask = BIT(25),
+   .ctl_offs = 0x0344,
+   .sram_pdn_bits = GENMASK(9, 8),
+   .sram_pdn_ack_bits = GENMASK(13, 12),
+   .basic_clk_id = {"cam"},
+   .subsys_clk_prefix = "cam",
+   .bp_table = {
+   BUS_PROT(IFR_TYPE, 0x2d4, 0x2d8, 0, 0x2ec,
+   BIT(4) | BIT(5) | BIT(9) | BIT(13),
+   BIT(4) | BIT(5) | BIT(9) | BIT(13)),
+   BUS_PROT(IFR_TYPE, 0x2a0, 0x2a4, 0, 0x228,
+   BIT(28), BIT(28)),
+   BUS_PROT(IFR_TYPE, 0x2d4, 0x2d8, 0, 0x2ec,
+   BIT(11), 0),
+   BUS_PROT(SMI_TYPE, 0x3c4, 0x3c8, 0, 0x3c0,
+   BIT(3) | BIT(4), BIT(3) |

[PATCH v6 01/14] dt-bindings: mediatek: Add property to mt8183 smi-common

For scpsys driver using regmap based syscon driver API.

Signed-off-by: Weiyi Lu 
---
 .../devicetree/bindings/memory-controllers/mediatek,smi-common.txt  | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.txt 
b/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.txt
index b478ade..01744ec 100644
--- 
a/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.txt
+++ 
b/Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.txt
@@ -20,7 +20,7 @@ Required properties:
"mediatek,mt2712-smi-common"
"mediatek,mt7623-smi-common", "mediatek,mt2701-smi-common"
"mediatek,mt8173-smi-common"
-   "mediatek,mt8183-smi-common"
+   "mediatek,mt8183-smi-common", "syscon"
 - reg : the register and size of the SMI block.
 - power-domains : a phandle to the power domain of this local arbiter.
 - clocks : Must contain an entry for each entry in clock-names.
-- 
1.8.1.1.dirty

[PATCH v6 03/14] soc: mediatek: Switch to SPDX license identifier

Signed-off-by: Weiyi Lu 
---
 drivers/soc/mediatek/mtk-scpsys.c | 16 
 1 file changed, 4 insertions(+), 12 deletions(-)

diff --git a/drivers/soc/mediatek/mtk-scpsys.c 
b/drivers/soc/mediatek/mtk-scpsys.c
index 5b24bb4..9f52f50 100644
--- a/drivers/soc/mediatek/mtk-scpsys.c
+++ b/drivers/soc/mediatek/mtk-scpsys.c
@@ -1,15 +1,7 @@
-/*
- * Copyright (c) 2015 Pengutronix, Sascha Hauer 
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
- */
+// SPDX-License-Identifier: GPL-2.0
+//
+// Copyright (c) 2015 Pengutronix, Sascha Hauer 
+
 #include 
 #include 
 #include 
-- 
1.8.1.1.dirty

[PATCH v6 06/14] soc: mediatek: Refactor clock control

Put clock enable and disable control in separate function.

Signed-off-by: Weiyi Lu 
---
 drivers/soc/mediatek/mtk-scpsys.c | 45 ---
 1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/drivers/soc/mediatek/mtk-scpsys.c 
b/drivers/soc/mediatek/mtk-scpsys.c
index 1a6a4ab..5b73e4e 100644
--- a/drivers/soc/mediatek/mtk-scpsys.c
+++ b/drivers/soc/mediatek/mtk-scpsys.c
@@ -207,6 +207,29 @@ static int scpsys_regulator_disable(struct scp_domain 
*scpd)
return regulator_disable(scpd->supply);
 }
 
+static void scpsys_clk_disable(struct clk *clk[], int max_num)
+{
+   int i;
+
+   for (i = max_num - 1; i >= 0; i--)
+   clk_disable_unprepare(clk[i]);
+}
+
+static int scpsys_clk_enable(struct clk *clk[], int max_num)
+{
+   int i, ret = 0;
+
+   for (i = 0; i < max_num && clk[i]; i++) {
+   ret = clk_prepare_enable(clk[i]);
+   if (ret) {
+   scpsys_clk_disable(clk, i);
+   break;
+   }
+   }
+
+   return ret;
+}
+
 static int scpsys_power_on(struct generic_pm_domain *genpd)
 {
struct scp_domain *scpd = container_of(genpd, struct scp_domain, genpd);
@@ -215,21 +238,14 @@ static int scpsys_power_on(struct generic_pm_domain 
*genpd)
u32 pdn_ack = scpd->data->sram_pdn_ack_bits;
u32 val;
int ret, tmp;
-   int i;
 
ret = scpsys_regulator_enable(scpd);
if (ret < 0)
return ret;
 
-   for (i = 0; i < MAX_CLKS && scpd->clk[i]; i++) {
-   ret = clk_prepare_enable(scpd->clk[i]);
-   if (ret) {
-   for (--i; i >= 0; i--)
-   clk_disable_unprepare(scpd->clk[i]);
-
-   goto err_clk;
-   }
-   }
+   ret = scpsys_clk_enable(scpd->clk, MAX_CLKS);
+   if (ret)
+   goto err_clk;
 
val = readl(ctl_addr);
val |= PWR_ON_BIT;
@@ -282,10 +298,7 @@ static int scpsys_power_on(struct generic_pm_domain *genpd)
return 0;
 
 err_pwr_ack:
-   for (i = MAX_CLKS - 1; i >= 0; i--) {
-   if (scpd->clk[i])
-   clk_disable_unprepare(scpd->clk[i]);
-   }
+   scpsys_clk_disable(scpd->clk, MAX_CLKS);
 err_clk:
scpsys_regulator_disable(scpd);
 
@@ -302,7 +315,6 @@ static int scpsys_power_off(struct generic_pm_domain *genpd)
u32 pdn_ack = scpd->data->sram_pdn_ack_bits;
u32 val;
int ret, tmp;
-   int i;
 
if (scpd->data->bus_prot_mask) {
ret = mtk_infracfg_set_bus_protection(scp->infracfg,
@@ -343,8 +355,7 @@ static int scpsys_power_off(struct generic_pm_domain *genpd)
if (ret < 0)
goto out;
 
-   for (i = 0; i < MAX_CLKS && scpd->clk[i]; i++)
-   clk_disable_unprepare(scpd->clk[i]);
+   scpsys_clk_disable(scpd->clk, MAX_CLKS);
 
ret = scpsys_regulator_disable(scpd);
if (ret < 0)
-- 
1.8.1.1.dirty

[PATCH v6 00/14] Mediatek MT8183 scpsys support

This series is based on v5.2-rc1 with MT8183 dts v11 patch[1] and
MT8183 SMI dt-binding v7 patch[2].

[1] https://patchwork.kernel.org/patch/10962375/
[2] https://patchwork.kernel.org/patch/10984743/

changes since v5:
- fix documentation in [PATCH 04/14]
- remove useless variable checking and reuse API of clock control in [PATCH 
06/14]
- coding style fix of bus protection control in [PATCH 08/14]
- fix naming of new added data in [PATCH 09/14]
- small refactor of multiple step bus protection control in [PATCH 10/14]

changes since v4:
- add property to mt8183 smi-common
- seperate refactor patches and new add function
- add power controller device node

--

Weiyi Lu (14):
  dt-bindings: mediatek: Add property to mt8183 smi-common
  dt-bindings: soc: Add MT8183 power dt-bindings
  soc: mediatek: Switch to SPDX license identifier
  soc: mediatek: Refactor polling timeout and documentation
  soc: mediatek: Refactor regulator control
  soc: mediatek: Refactor clock control
  soc: mediatek: Refactor sram control
  soc: mediatek: Refactor bus protection control
  soc: mediatek: Add basic_clk_id to scp_power_data
  soc: mediatek: Add multiple step bus protection control
  soc: mediatek: Add subsys clock control for bus protection
  soc: mediatek: Add extra sram control
  soc: mediatek: Add MT8183 scpsys support
  arm64: dts: Add power controller device node of MT8183

 .../memory-controllers/mediatek,smi-common.txt |   2 +-
 .../devicetree/bindings/soc/mediatek/scpsys.txt|  14 +
 arch/arm64/boot/dts/mediatek/mt8183.dtsi   |  62 +++
 drivers/soc/mediatek/Makefile  |   2 +-
 drivers/soc/mediatek/mtk-scpsys-ext.c  |  99 
 drivers/soc/mediatek/mtk-scpsys.c  | 591 ++---
 include/dt-bindings/power/mt8183-power.h   |  26 +
 include/linux/soc/mediatek/scpsys-ext.h|  39 ++
 8 files changed, 745 insertions(+), 90 deletions(-)
 create mode 100644 drivers/soc/mediatek/mtk-scpsys-ext.c
 create mode 100644 include/dt-bindings/power/mt8183-power.h
 create mode 100644 include/linux/soc/mediatek/scpsys-ext.h

--

[PATCH v6 12/14] soc: mediatek: Add extra sram control

For some power domains like vpu_core on MT8183 whose sram need to
do clock and internal isolation while power on/off sram.
We add a flag "sram_iso_ctrl" in scp_domain_data to judge if we
need to do the extra sram isolation control or not.

Signed-off-by: Weiyi Lu 
---
 drivers/soc/mediatek/mtk-scpsys.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/drivers/soc/mediatek/mtk-scpsys.c 
b/drivers/soc/mediatek/mtk-scpsys.c
index 74fd981..d3fdb3f 100644
--- a/drivers/soc/mediatek/mtk-scpsys.c
+++ b/drivers/soc/mediatek/mtk-scpsys.c
@@ -57,6 +57,8 @@
 #define PWR_ON_BIT BIT(2)
 #define PWR_ON_2ND_BIT BIT(3)
 #define PWR_CLK_DIS_BITBIT(4)
+#define PWR_SRAM_CLKISO_BITBIT(5)
+#define PWR_SRAM_ISOINT_B_BIT  BIT(6)
 
 #define PWR_STATUS_CONNBIT(1)
 #define PWR_STATUS_DISPBIT(3)
@@ -115,6 +117,8 @@ enum clk_id {
  * @name: The domain name.
  * @sta_mask: The mask for power on/off status bit.
  * @ctl_offs: The offset for main power control register.
+ * @sram_iso_ctrl: The flag to judge if the power domain need to do
+ * the extra sram isolation control.
  * @sram_pdn_bits: The mask for sram power control bits.
  * @sram_pdn_ack_bits: The mask for sram power control acked bits.
  * @bus_prot_mask: The mask for single step bus protection.
@@ -130,6 +134,7 @@ struct scp_domain_data {
const char *name;
u32 sta_mask;
int ctl_offs;
+   bool sram_iso_ctrl;
u32 sram_pdn_bits;
u32 sram_pdn_ack_bits;
u32 bus_prot_mask;
@@ -268,6 +273,14 @@ static int scpsys_sram_enable(struct scp_domain *scpd, 
void __iomem *ctl_addr)
return ret;
}
 
+   if (scpd->data->sram_iso_ctrl)  {
+   val = readl(ctl_addr) | PWR_SRAM_ISOINT_B_BIT;
+   writel(val, ctl_addr);
+   udelay(1);
+   val &= ~PWR_SRAM_CLKISO_BIT;
+   writel(val, ctl_addr);
+   }
+
return 0;
 }
 
@@ -277,6 +290,15 @@ static int scpsys_sram_disable(struct scp_domain *scpd, 
void __iomem *ctl_addr)
u32 pdn_ack = scpd->data->sram_pdn_ack_bits;
int tmp;
 
+   if (scpd->data->sram_iso_ctrl)  {
+   val = readl(ctl_addr);
+   val |= PWR_SRAM_CLKISO_BIT;
+   writel(val, ctl_addr);
+   val &= ~PWR_SRAM_ISOINT_B_BIT;
+   writel(val, ctl_addr);
+   udelay(1);
+   }
+
val = readl(ctl_addr) | scpd->data->sram_pdn_bits;
writel(val, ctl_addr);
 
-- 
1.8.1.1.dirty

[PATCH v6 04/14] soc: mediatek: Refactor polling timeout and documentation

Use USEC_PER_SEC to indicate the polling timeout directly.
And add documentation of scp_domain_data.

Signed-off-by: Weiyi Lu 
---
 drivers/soc/mediatek/mtk-scpsys.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/soc/mediatek/mtk-scpsys.c 
b/drivers/soc/mediatek/mtk-scpsys.c
index 9f52f50..f775b1b 100644
--- a/drivers/soc/mediatek/mtk-scpsys.c
+++ b/drivers/soc/mediatek/mtk-scpsys.c
@@ -21,7 +21,7 @@
 #include 
 
 #define MTK_POLL_DELAY_US   10
-#define MTK_POLL_TIMEOUT(jiffies_to_usecs(HZ))
+#define MTK_POLL_TIMEOUTUSEC_PER_SEC
 
 #define MTK_SCPD_ACTIVE_WAKEUP BIT(0)
 #define MTK_SCPD_FWAIT_SRAMBIT(1)
@@ -108,6 +108,17 @@ enum clk_id {
 
 #define MAX_CLKS   3
 
+/**
+ * struct scp_domain_data - scp domain data for power on/off flow
+ * @name: The domain name.
+ * @sta_mask: The mask for power on/off status bit.
+ * @ctl_offs: The offset for main power control register.
+ * @sram_pdn_bits: The mask for sram power control bits.
+ * @sram_pdn_ack_bits: The mask for sram power control acked bits.
+ * @bus_prot_mask: The mask for single step bus protection.
+ * @clk_id: The basic clocks required by this power domain.
+ * @caps: The flag for active wake-up action.
+ */
 struct scp_domain_data {
const char *name;
u32 sta_mask;
-- 
1.8.1.1.dirty

Re: [PATCH 1/2] i2c: aspeed: allow to customize base clock divisor

2019-06-19 Thread Tao Ren

On 6/19/19 4:02 PM, Benjamin Herrenschmidt wrote:
> On Wed, 2019-06-19 at 22:32 +, Tao Ren wrote:
>> Thank you for the quick response, Brendan.
>>
>> Aspeed I2C bus frequency is defined by 3 parameters
>> (base_clk_divisor, clk_high_width, clk_low_width), and I choose
>> base_clk_divisor because it controls all the Aspeed I2C timings (such
>> as setup time and hold time). Once base_clk_divisor is decided
>> (either by the current logic in i2c-aspeed driver or manually set in
>> device tree), clk_high_width and clk_low_width will be calculated by
>> i2c-aspeed driver to meet the specified I2C bus speed.
>>
>> For example, by setting I2C bus frequency to 100KHz on AST2500
>> platform, (base_clock_divisor, clk_high_width, clk_low_width) is set
>> to (3, 15, 14) by our driver. But some slave devices (on CMM i2c-8
>> and Minipack i2c-0) NACK byte transactions with the default timing
>> setting: the issue can be resolved by setting base_clk_divisor to 4,
>> and (clk_high_width, clk_low_width) will be set to (7, 7) by our i2c-
>> aspeed driver to achieve similar I2C bus speed.
>>
>> Not sure if my answer helps to address your concerns, but kindly let
>> me know if you have further questions/suggestions.
> 
> Did you look at the resulting output on a scope ? I'm curious what
> might be wrong 
> 
> CCing Ryan from Aspeed, he might have some idea.
> 
> Could it be that with some specific dividers you have more jitter ?
> Still, i2c devices tend to be rather robust vs crappy clocks unless you
> are massively out of bounds, which makes me wonder whether something
> else might be wrong in your setup.
> 
> Cheers,
> Ben.

I've reached out to hardware team to see if they can provide more inputs (such 
as protocol decoder output) but so far I don't have such data. I'm suspecting 
it's caused by I2C timing mainly because:

1) the intermittent i2c transaction failures always happen to slave devices 
which are furthest away from ASPEED chip.

2) As the i2c-aspeed driver in my kernel 4.1 tree (derived from ASPEED SDK) 
works properly, and I copied I2CD04 (Clock and AC Timing Control) register 
value from kernel 4.1 and applied to the latest upstream driver: the 
transaction failure is fixed :)

Thank you Ben for looking into the issue and involving more experts (Ryan) for 
discussion. I have been suffering from the problem for several months and I'm 
looking forward for proper/right solutions.


Cheers,

Tao

[PATCH] staging: rtl8723bs: hal: hal_btcoex: Remove variables pHalData and pU1Tmp

2019-06-19 Thread Hariprasad Kelam

Remove pHalData variable as it is set but unused in function.
Remove pU1Tmp and replace this with pu8

Signed-off-by: Hariprasad Kelam 
---
 drivers/staging/rtl8723bs/hal/hal_btcoex.c | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/drivers/staging/rtl8723bs/hal/hal_btcoex.c 
b/drivers/staging/rtl8723bs/hal/hal_btcoex.c
index fd0be52..e673319 100644
--- a/drivers/staging/rtl8723bs/hal/hal_btcoex.c
+++ b/drivers/staging/rtl8723bs/hal/hal_btcoex.c
@@ -560,18 +560,14 @@ static u8 halbtcoutsrc_Set(void *pBtcContext, u8 setType, 
void *pInBuf)
 {
PBTC_COEXIST pBtCoexist;
struct adapter *padapter;
-   struct hal_com_data *pHalData;
u8 *pu8;
-   u8 *pU1Tmp;
u32 *pU4Tmp;
u8 ret;
 
 
pBtCoexist = (PBTC_COEXIST)pBtcContext;
padapter = pBtCoexist->Adapter;
-   pHalData = GET_HAL_DATA(padapter);
pu8 = pInBuf;
-   pU1Tmp = pInBuf;
pU4Tmp = pInBuf;
ret = true;
 
@@ -614,11 +610,11 @@ static u8 halbtcoutsrc_Set(void *pBtcContext, u8 setType, 
void *pInBuf)
 
/*  set some u8 type variables. */
case BTC_SET_U1_RSSI_ADJ_VAL_FOR_AGC_TABLE_ON:
-   pBtCoexist->btInfo.rssiAdjustForAgcTableOn = *pU1Tmp;
+   pBtCoexist->btInfo.rssiAdjustForAgcTableOn = *pu8;
break;
 
case BTC_SET_U1_AGG_BUF_SIZE:
-   pBtCoexist->btInfo.aggBufSize = *pU1Tmp;
+   pBtCoexist->btInfo.aggBufSize = *pu8;
break;
 
/*  the following are some action which will be triggered */
@@ -633,15 +629,15 @@ static u8 halbtcoutsrc_Set(void *pBtcContext, u8 setType, 
void *pInBuf)
/* 1Ant === */
/*  set some u8 type variables. */
case BTC_SET_U1_RSSI_ADJ_VAL_FOR_1ANT_COEX_TYPE:
-   pBtCoexist->btInfo.rssiAdjustFor1AntCoexType = *pU1Tmp;
+   pBtCoexist->btInfo.rssiAdjustFor1AntCoexType = *pu8;
break;
 
case BTC_SET_U1_LPS_VAL:
-   pBtCoexist->btInfo.lpsVal = *pU1Tmp;
+   pBtCoexist->btInfo.lpsVal = *pu8;
break;
 
case BTC_SET_U1_RPWM_VAL:
-   pBtCoexist->btInfo.rpwmVal = *pU1Tmp;
+   pBtCoexist->btInfo.rpwmVal = *pu8;
break;
 
/*  the following are some action which will be triggered */
-- 
2.7.4

[PATCH v5 25/25] userfaultfd: selftests: add write-protect test

This patch adds uffd tests for write protection.

Instead of introducing new tests for it, let's simply squashing uffd-wp
tests into existing uffd-missing test cases.  Changes are:

(1) Bouncing tests

  We do the write-protection in two ways during the bouncing test:

  - By using UFFDIO_COPY_MODE_WP when resolving MISSING pages: then
we'll make sure for each bounce process every single page will be
at least fault twice: once for MISSING, once for WP.

  - By direct call UFFDIO_WRITEPROTECT on existing faulted memories:
To further torture the explicit page protection procedures of
uffd-wp, we split each bounce procedure into two halves (in the
background thread): the first half will be MISSING+WP for each
page as explained above.  After the first half, we write protect
the faulted region in the background thread to make sure at least
half of the pages will be write protected again which is the first
half to test the new UFFDIO_WRITEPROTECT call.  Then we continue
with the 2nd half, which will contain both MISSING and WP faulting
tests for the 2nd half and WP-only faults from the 1st half.

(2) Event/Signal test

  Mostly previous tests but will do MISSING+WP for each page.  For
  sigbus-mode test we'll need to provide standalone path to handle the
  write protection faults.

For all tests, do statistics as well for uffd-wp pages.

Signed-off-by: Peter Xu 
---
 tools/testing/selftests/vm/userfaultfd.c | 157 +++
 1 file changed, 133 insertions(+), 24 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c 
b/tools/testing/selftests/vm/userfaultfd.c
index 417dbdf4d379..fa362fe311e3 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -56,6 +56,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "../kselftest.h"
 
@@ -78,6 +79,8 @@ static int test_type;
 #define ALARM_INTERVAL_SECS 10
 static volatile bool test_uffdio_copy_eexist = true;
 static volatile bool test_uffdio_zeropage_eexist = true;
+/* Whether to test uffd write-protection */
+static bool test_uffdio_wp = false;
 
 static bool map_shared;
 static int huge_fd;
@@ -92,6 +95,7 @@ pthread_attr_t attr;
 struct uffd_stats {
int cpu;
unsigned long missing_faults;
+   unsigned long wp_faults;
 };
 
 /* pthread_mutex_t starts at page offset 0 */
@@ -141,9 +145,29 @@ static void uffd_stats_reset(struct uffd_stats *uffd_stats,
for (i = 0; i < n_cpus; i++) {
uffd_stats[i].cpu = i;
uffd_stats[i].missing_faults = 0;
+   uffd_stats[i].wp_faults = 0;
}
 }
 
+static void uffd_stats_report(struct uffd_stats *stats, int n_cpus)
+{
+   int i;
+   unsigned long long miss_total = 0, wp_total = 0;
+
+   for (i = 0; i < n_cpus; i++) {
+   miss_total += stats[i].missing_faults;
+   wp_total += stats[i].wp_faults;
+   }
+
+   printf("userfaults: %llu missing (", miss_total);
+   for (i = 0; i < n_cpus; i++)
+   printf("%lu+", stats[i].missing_faults);
+   printf("\b), %llu wp (", wp_total);
+   for (i = 0; i < n_cpus; i++)
+   printf("%lu+", stats[i].wp_faults);
+   printf("\b)\n");
+}
+
 static int anon_release_pages(char *rel_area)
 {
int ret = 0;
@@ -264,10 +288,15 @@ struct uffd_test_ops {
void (*alias_mapping)(__u64 *start, size_t len, unsigned long offset);
 };
 
-#define ANON_EXPECTED_IOCTLS   ((1 << _UFFDIO_WAKE) | \
+#define SHMEM_EXPECTED_IOCTLS  ((1 << _UFFDIO_WAKE) | \
 (1 << _UFFDIO_COPY) | \
 (1 << _UFFDIO_ZEROPAGE))
 
+#define ANON_EXPECTED_IOCTLS   ((1 << _UFFDIO_WAKE) | \
+(1 << _UFFDIO_COPY) | \
+(1 << _UFFDIO_ZEROPAGE) | \
+(1 << _UFFDIO_WRITEPROTECT))
+
 static struct uffd_test_ops anon_uffd_test_ops = {
.expected_ioctls = ANON_EXPECTED_IOCTLS,
.allocate_area  = anon_allocate_area,
@@ -276,7 +305,7 @@ static struct uffd_test_ops anon_uffd_test_ops = {
 };
 
 static struct uffd_test_ops shmem_uffd_test_ops = {
-   .expected_ioctls = ANON_EXPECTED_IOCTLS,
+   .expected_ioctls = SHMEM_EXPECTED_IOCTLS,
.allocate_area  = shmem_allocate_area,
.release_pages  = shmem_release_pages,
.alias_mapping = noop_alias_mapping,
@@ -300,6 +329,21 @@ static int my_bcmp(char *str1, char *str2, size_t n)
return 0;
 }
 
+static void wp_range(int ufd, __u64 start, __u64 len, bool wp)
+{
+   struct uffdio_writeprotect prms = { 0 };
+
+   /* Write protection page faults */
+   prms.range.start = start;
+   prms.range.len = len;
+   /* Undo write-protect, do wakeup after that */
+   prms.mode = wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0;
+
+   if (ioctl(ufd,

[PATCH v5 23/25] userfaultfd: wp: declare _UFFDIO_WRITEPROTECT conditionally

Only declare _UFFDIO_WRITEPROTECT if the user specified
UFFDIO_REGISTER_MODE_WP and if all the checks passed.  Then when the
user registers regions with shmem/hugetlbfs we won't expose the new
ioctl to them.  Even with complete anonymous memory range, we'll only
expose the new WP ioctl bit if the register mode has MODE_WP.

Reviewed-by: Mike Rapoport 
Signed-off-by: Peter Xu 
---
 fs/userfaultfd.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 498971fa9163..4e1d7748224a 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1465,14 +1465,24 @@ static int userfaultfd_register(struct userfaultfd_ctx 
*ctx,
up_write(>mmap_sem);
mmput(mm);
if (!ret) {
+   __u64 ioctls_out;
+
+   ioctls_out = basic_ioctls ? UFFD_API_RANGE_IOCTLS_BASIC :
+   UFFD_API_RANGE_IOCTLS;
+
+   /*
+* Declare the WP ioctl only if the WP mode is
+* specified and all checks passed with the range
+*/
+   if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_WP))
+   ioctls_out &= ~((__u64)1 << _UFFDIO_WRITEPROTECT);
+
/*
 * Now that we scanned all vmas we can already tell
 * userland which ioctls methods are guaranteed to
 * succeed on this range.
 */
-   if (put_user(basic_ioctls ? UFFD_API_RANGE_IOCTLS_BASIC :
-UFFD_API_RANGE_IOCTLS,
-_uffdio_register->ioctls))
+   if (put_user(ioctls_out, _uffdio_register->ioctls))
ret = -EFAULT;
}
 out:
-- 
2.21.0

[PATCH v5 24/25] userfaultfd: selftests: refactor statistics

Introduce uffd_stats structure for statistics of the self test, at the
same time refactor the code to always pass in the uffd_stats for either
read() or poll() typed fault handling threads instead of using two
different ways to return the statistic results.  No functional change.

With the new structure, it's very easy to introduce new statistics.

Reviewed-by: Mike Rapoport 
Signed-off-by: Peter Xu 
---
 tools/testing/selftests/vm/userfaultfd.c | 76 +++-
 1 file changed, 49 insertions(+), 27 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c 
b/tools/testing/selftests/vm/userfaultfd.c
index b3e6497b080c..417dbdf4d379 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -88,6 +88,12 @@ static char *area_src, *area_src_alias, *area_dst, 
*area_dst_alias;
 static char *zeropage;
 pthread_attr_t attr;
 
+/* Userfaultfd test statistics */
+struct uffd_stats {
+   int cpu;
+   unsigned long missing_faults;
+};
+
 /* pthread_mutex_t starts at page offset 0 */
 #define area_mutex(___area, ___nr) \
((pthread_mutex_t *) ((___area) + (___nr)*page_size))
@@ -127,6 +133,17 @@ static void usage(void)
exit(1);
 }
 
+static void uffd_stats_reset(struct uffd_stats *uffd_stats,
+unsigned long n_cpus)
+{
+   int i;
+
+   for (i = 0; i < n_cpus; i++) {
+   uffd_stats[i].cpu = i;
+   uffd_stats[i].missing_faults = 0;
+   }
+}
+
 static int anon_release_pages(char *rel_area)
 {
int ret = 0;
@@ -469,8 +486,8 @@ static int uffd_read_msg(int ufd, struct uffd_msg *msg)
return 0;
 }
 
-/* Return 1 if page fault handled by us; otherwise 0 */
-static int uffd_handle_page_fault(struct uffd_msg *msg)
+static void uffd_handle_page_fault(struct uffd_msg *msg,
+  struct uffd_stats *stats)
 {
unsigned long offset;
 
@@ -485,18 +502,19 @@ static int uffd_handle_page_fault(struct uffd_msg *msg)
offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst;
offset &= ~(page_size-1);
 
-   return copy_page(uffd, offset);
+   if (copy_page(uffd, offset))
+   stats->missing_faults++;
 }
 
 static void *uffd_poll_thread(void *arg)
 {
-   unsigned long cpu = (unsigned long) arg;
+   struct uffd_stats *stats = (struct uffd_stats *)arg;
+   unsigned long cpu = stats->cpu;
struct pollfd pollfd[2];
struct uffd_msg msg;
struct uffdio_register uffd_reg;
int ret;
char tmp_chr;
-   unsigned long userfaults = 0;
 
pollfd[0].fd = uffd;
pollfd[0].events = POLLIN;
@@ -526,7 +544,7 @@ static void *uffd_poll_thread(void *arg)
msg.event), exit(1);
break;
case UFFD_EVENT_PAGEFAULT:
-   userfaults += uffd_handle_page_fault();
+   uffd_handle_page_fault(, stats);
break;
case UFFD_EVENT_FORK:
close(uffd);
@@ -545,28 +563,27 @@ static void *uffd_poll_thread(void *arg)
break;
}
}
-   return (void *)userfaults;
+
+   return NULL;
 }
 
 pthread_mutex_t uffd_read_mutex = PTHREAD_MUTEX_INITIALIZER;
 
 static void *uffd_read_thread(void *arg)
 {
-   unsigned long *this_cpu_userfaults;
+   struct uffd_stats *stats = (struct uffd_stats *)arg;
struct uffd_msg msg;
 
-   this_cpu_userfaults = (unsigned long *) arg;
-   *this_cpu_userfaults = 0;
-
pthread_mutex_unlock(_read_mutex);
/* from here cancellation is ok */
 
for (;;) {
if (uffd_read_msg(uffd, ))
continue;
-   (*this_cpu_userfaults) += uffd_handle_page_fault();
+   uffd_handle_page_fault(, stats);
}
-   return (void *)NULL;
+
+   return NULL;
 }
 
 static void *background_thread(void *arg)
@@ -582,13 +599,12 @@ static void *background_thread(void *arg)
return NULL;
 }
 
-static int stress(unsigned long *userfaults)
+static int stress(struct uffd_stats *uffd_stats)
 {
unsigned long cpu;
pthread_t locking_threads[nr_cpus];
pthread_t uffd_threads[nr_cpus];
pthread_t background_threads[nr_cpus];
-   void **_userfaults = (void **) userfaults;
 
finished = 0;
for (cpu = 0; cpu < nr_cpus; cpu++) {
@@ -597,12 +613,13 @@ static int stress(unsigned long *userfaults)
return 1;
if (bounces & BOUNCE_POLL) {
if (pthread_create(_threads[cpu], ,
-  uffd_poll_thread, (void *)cpu))
+  uffd_poll_thread,
+  (void *)_stats[cpu]))
return

[PATCH v5 22/25] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update

From: Martin Cracauer 

Adds documentation about the write protection support.

Signed-off-by: Martin Cracauer 
Signed-off-by: Andrea Arcangeli 
[peterx: rewrite in rst format; fixups here and there]
Reviewed-by: Jerome Glisse 
Reviewed-by: Mike Rapoport 
Signed-off-by: Peter Xu 
---
 Documentation/admin-guide/mm/userfaultfd.rst | 51 
 1 file changed, 51 insertions(+)

diff --git a/Documentation/admin-guide/mm/userfaultfd.rst 
b/Documentation/admin-guide/mm/userfaultfd.rst
index 5048cf661a8a..c30176e67900 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -108,6 +108,57 @@ UFFDIO_COPY. They're atomic as in guaranteeing that 
nothing can see an
 half copied page since it'll keep userfaulting until the copy has
 finished.
 
+Notes:
+
+- If you requested UFFDIO_REGISTER_MODE_MISSING when registering then
+  you must provide some kind of page in your thread after reading from
+  the uffd.  You must provide either UFFDIO_COPY or UFFDIO_ZEROPAGE.
+  The normal behavior of the OS automatically providing a zero page on
+  an annonymous mmaping is not in place.
+
+- None of the page-delivering ioctls default to the range that you
+  registered with.  You must fill in all fields for the appropriate
+  ioctl struct including the range.
+
+- You get the address of the access that triggered the missing page
+  event out of a struct uffd_msg that you read in the thread from the
+  uffd.  You can supply as many pages as you want with UFFDIO_COPY or
+  UFFDIO_ZEROPAGE.  Keep in mind that unless you used DONTWAKE then
+  the first of any of those IOCTLs wakes up the faulting thread.
+
+- Be sure to test for all errors including (pollfd[0].revents &
+  POLLERR).  This can happen, e.g. when ranges supplied were
+  incorrect.
+
+Write Protect Notifications
+---
+
+This is equivalent to (but faster than) using mprotect and a SIGSEGV
+signal handler.
+
+Firstly you need to register a range with UFFDIO_REGISTER_MODE_WP.
+Instead of using mprotect(2) you use ioctl(uffd, UFFDIO_WRITEPROTECT,
+struct *uffdio_writeprotect) while mode = UFFDIO_WRITEPROTECT_MODE_WP
+in the struct passed in.  The range does not default to and does not
+have to be identical to the range you registered with.  You can write
+protect as many ranges as you like (inside the registered range).
+Then, in the thread reading from uffd the struct will have
+msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP set. Now you send
+ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect) again
+while pagefault.mode does not have UFFDIO_WRITEPROTECT_MODE_WP set.
+This wakes up the thread which will continue to run with writes. This
+allows you to do the bookkeeping about the write in the uffd reading
+thread before the ioctl.
+
+If you registered with both UFFDIO_REGISTER_MODE_MISSING and
+UFFDIO_REGISTER_MODE_WP then you need to think about the sequence in
+which you supply a page and undo write protect.  Note that there is a
+difference between writes into a WP area and into a !WP area.  The
+former will have UFFD_PAGEFAULT_FLAG_WP set, the latter
+UFFD_PAGEFAULT_FLAG_WRITE.  The latter did not fail on protection but
+you still need to supply a page when UFFDIO_REGISTER_MODE_MISSING was
+used.
+
 QEMU/KVM
 
 
-- 
2.21.0

[PATCH v5 21/25] userfaultfd: wp: don't wake up when doing write protect

It does not make sense to try to wake up any waiting thread when we're
write-protecting a memory region.  Only wake up when resolving a write
protected page fault.

Reviewed-by: Mike Rapoport 
Signed-off-by: Peter Xu 
---
 fs/userfaultfd.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 3cf19aeaa0e0..498971fa9163 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1782,6 +1782,7 @@ static int userfaultfd_writeprotect(struct 
userfaultfd_ctx *ctx,
struct uffdio_writeprotect uffdio_wp;
struct uffdio_writeprotect __user *user_uffdio_wp;
struct userfaultfd_wake_range range;
+   bool mode_wp, mode_dontwake;
 
if (READ_ONCE(ctx->mmap_changing))
return -EAGAIN;
@@ -1800,18 +1801,20 @@ static int userfaultfd_writeprotect(struct 
userfaultfd_ctx *ctx,
if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
   UFFDIO_WRITEPROTECT_MODE_WP))
return -EINVAL;
-   if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) &&
-(uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE))
+
+   mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
+   mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
+
+   if (mode_wp && mode_dontwake)
return -EINVAL;
 
ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
- uffdio_wp.range.len, uffdio_wp.mode &
- UFFDIO_WRITEPROTECT_MODE_WP,
+ uffdio_wp.range.len, mode_wp,
  >mmap_changing);
if (ret)
return ret;
 
-   if (!(uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE)) {
+   if (!mode_wp && !mode_dontwake) {
range.start = uffdio_wp.range.start;
range.len = uffdio_wp.range.len;
wake_userfault(ctx, );
-- 
2.21.0

[PATCH v5 20/25] userfaultfd: wp: enabled write protection in userfaultfd API

From: Shaohua Li 

Now it's safe to enable write protection in userfaultfd API

Cc: Andrea Arcangeli 
Cc: Pavel Emelyanov 
Cc: Rik van Riel 
Cc: Kirill A. Shutemov 
Cc: Mel Gorman 
Cc: Hugh Dickins 
Cc: Johannes Weiner 
Signed-off-by: Shaohua Li 
Signed-off-by: Andrea Arcangeli 
Reviewed-by: Jerome Glisse 
Reviewed-by: Mike Rapoport 
Signed-off-by: Peter Xu 
---
 include/uapi/linux/userfaultfd.h | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 95c4a160e5f8..e7e98bde221f 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -19,7 +19,8 @@
  * means the userland is reading).
  */
 #define UFFD_API ((__u64)0xAA)
-#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK |   \
+#define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP |\
+  UFFD_FEATURE_EVENT_FORK |\
   UFFD_FEATURE_EVENT_REMAP |   \
   UFFD_FEATURE_EVENT_REMOVE |  \
   UFFD_FEATURE_EVENT_UNMAP |   \
@@ -34,7 +35,8 @@
 #define UFFD_API_RANGE_IOCTLS  \
((__u64)1 << _UFFDIO_WAKE | \
 (__u64)1 << _UFFDIO_COPY | \
-(__u64)1 << _UFFDIO_ZEROPAGE)
+(__u64)1 << _UFFDIO_ZEROPAGE | \
+(__u64)1 << _UFFDIO_WRITEPROTECT)
 #define UFFD_API_RANGE_IOCTLS_BASIC\
((__u64)1 << _UFFDIO_WAKE | \
 (__u64)1 << _UFFDIO_COPY)
-- 
2.21.0

[PATCH v5 18/25] userfaultfd: wp: support write protection for userfault vma range

From: Shaohua Li 

Add API to enable/disable writeprotect a vma range. Unlike mprotect,
this doesn't split/merge vmas.

Cc: Andrea Arcangeli 
Cc: Rik van Riel 
Cc: Kirill A. Shutemov 
Cc: Mel Gorman 
Cc: Hugh Dickins 
Cc: Johannes Weiner 
Signed-off-by: Shaohua Li 
Signed-off-by: Andrea Arcangeli 
[peterx:
 - use the helper to find VMA;
 - return -ENOENT if not found to match mcopy case;
 - use the new MM_CP_UFFD_WP* flags for change_protection
 - check against mmap_changing for failures]
Reviewed-by: Jerome Glisse 
Reviewed-by: Mike Rapoport 
Signed-off-by: Peter Xu 
---
 include/linux/userfaultfd_k.h |  3 ++
 mm/userfaultfd.c  | 54 +++
 2 files changed, 57 insertions(+)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index dcd33172b728..a8e5f3ea9bb2 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -41,6 +41,9 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
  unsigned long dst_start,
  unsigned long len,
  bool *mmap_changing);
+extern int mwriteprotect_range(struct mm_struct *dst_mm,
+  unsigned long start, unsigned long len,
+  bool enable_wp, bool *mmap_changing);
 
 /* mm helpers */
 static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 6b9dd5b66f64..4208592c7ca3 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -638,3 +638,57 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned 
long start,
 {
return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing, 0);
 }
+
+int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
+   unsigned long len, bool enable_wp, bool *mmap_changing)
+{
+   struct vm_area_struct *dst_vma;
+   pgprot_t newprot;
+   int err;
+
+   /*
+* Sanitize the command parameters:
+*/
+   BUG_ON(start & ~PAGE_MASK);
+   BUG_ON(len & ~PAGE_MASK);
+
+   /* Does the address range wrap, or is the span zero-sized? */
+   BUG_ON(start + len <= start);
+
+   down_read(_mm->mmap_sem);
+
+   /*
+* If memory mappings are changing because of non-cooperative
+* operation (e.g. mremap) running in parallel, bail out and
+* request the user to retry later
+*/
+   err = -EAGAIN;
+   if (mmap_changing && READ_ONCE(*mmap_changing))
+   goto out_unlock;
+
+   err = -ENOENT;
+   dst_vma = vma_find_uffd(dst_mm, start, len);
+   /*
+* Make sure the vma is not shared, that the dst range is
+* both valid and fully within a single existing vma.
+*/
+   if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+   goto out_unlock;
+   if (!userfaultfd_wp(dst_vma))
+   goto out_unlock;
+   if (!vma_is_anonymous(dst_vma))
+   goto out_unlock;
+
+   if (enable_wp)
+   newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE));
+   else
+   newprot = vm_get_page_prot(dst_vma->vm_flags);
+
+   change_protection(dst_vma, start, start + len, newprot,
+ enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE);
+
+   err = 0;
+out_unlock:
+   up_read(_mm->mmap_sem);
+   return err;
+}
-- 
2.21.0

[PATCH v5 19/25] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl

From: Andrea Arcangeli 

v1: From: Shaohua Li 

v2: cleanups, remove a branch.

[peterx writes up the commit message, as below...]

This patch introduces the new uffd-wp APIs for userspace.

Firstly, we'll allow to do UFFDIO_REGISTER with write protection
tracking using the new UFFDIO_REGISTER_MODE_WP flag.  Note that this
flag can co-exist with the existing UFFDIO_REGISTER_MODE_MISSING, in
which case the userspace program can not only resolve missing page
faults, and at the same time tracking page data changes along the way.

Secondly, we introduced the new UFFDIO_WRITEPROTECT API to do page
level write protection tracking.  Note that we will need to register
the memory region with UFFDIO_REGISTER_MODE_WP before that.

Signed-off-by: Andrea Arcangeli 
[peterx: remove useless block, write commit message, check against
 VM_MAYWRITE rather than VM_WRITE when register]
Reviewed-by: Jerome Glisse 
Signed-off-by: Peter Xu 
---
 fs/userfaultfd.c | 82 +---
 include/uapi/linux/userfaultfd.h | 23 +
 2 files changed, 89 insertions(+), 16 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index c594945ad5bf..3cf19aeaa0e0 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -306,8 +306,11 @@ static inline bool userfaultfd_must_wait(struct 
userfaultfd_ctx *ctx,
if (!pmd_present(_pmd))
goto out;
 
-   if (pmd_trans_huge(_pmd))
+   if (pmd_trans_huge(_pmd)) {
+   if (!pmd_write(_pmd) && (reason & VM_UFFD_WP))
+   ret = true;
goto out;
+   }
 
/*
 * the pmd is stable (as in !pmd_trans_unstable) so we can re-read it
@@ -320,6 +323,8 @@ static inline bool userfaultfd_must_wait(struct 
userfaultfd_ctx *ctx,
 */
if (pte_none(*pte))
ret = true;
+   if (!pte_write(*pte) && (reason & VM_UFFD_WP))
+   ret = true;
pte_unmap(pte);
 
 out:
@@ -1258,10 +1263,13 @@ static __always_inline int validate_range(struct 
mm_struct *mm,
return 0;
 }
 
-static inline bool vma_can_userfault(struct vm_area_struct *vma)
+static inline bool vma_can_userfault(struct vm_area_struct *vma,
+unsigned long vm_flags)
 {
-   return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
-   vma_is_shmem(vma);
+   /* FIXME: add WP support to hugetlbfs and shmem */
+   return vma_is_anonymous(vma) ||
+   ((is_vm_hugetlb_page(vma) || vma_is_shmem(vma)) &&
+!(vm_flags & VM_UFFD_WP));
 }
 
 static int userfaultfd_register(struct userfaultfd_ctx *ctx,
@@ -1293,15 +1301,8 @@ static int userfaultfd_register(struct userfaultfd_ctx 
*ctx,
vm_flags = 0;
if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING)
vm_flags |= VM_UFFD_MISSING;
-   if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) {
+   if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP)
vm_flags |= VM_UFFD_WP;
-   /*
-* FIXME: remove the below error constraint by
-* implementing the wprotect tracking mode.
-*/
-   ret = -EINVAL;
-   goto out;
-   }
 
ret = validate_range(mm, uffdio_register.range.start,
 uffdio_register.range.len);
@@ -1351,7 +1352,7 @@ static int userfaultfd_register(struct userfaultfd_ctx 
*ctx,
 
/* check not compatible vmas */
ret = -EINVAL;
-   if (!vma_can_userfault(cur))
+   if (!vma_can_userfault(cur, vm_flags))
goto out_unlock;
 
/*
@@ -1379,6 +1380,8 @@ static int userfaultfd_register(struct userfaultfd_ctx 
*ctx,
if (end & (vma_hpagesize - 1))
goto out_unlock;
}
+   if ((vm_flags & VM_UFFD_WP) && !(cur->vm_flags & VM_MAYWRITE))
+   goto out_unlock;
 
/*
 * Check that this vma isn't already owned by a
@@ -1408,7 +1411,7 @@ static int userfaultfd_register(struct userfaultfd_ctx 
*ctx,
do {
cond_resched();
 
-   BUG_ON(!vma_can_userfault(vma));
+   BUG_ON(!vma_can_userfault(vma, vm_flags));
BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
   vma->vm_userfaultfd_ctx.ctx != ctx);
WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
@@ -1545,7 +1548,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx 
*ctx,
 * provides for more strict behavior to notice
 * unregistration errors.
 */
-   if (!vma_can_userfault(cur))
+   if (!vma_can_userfault(cur, cur->vm_flags))
goto out_unlock;
 
found = true;
@@ -1559,7 +1562,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx

Re: [PATCH v1 12/22] docs: driver-api: add .rst files from the main dir

2019-06-19 Thread Mauro Carvalho Chehab

Em Wed, 19 Jun 2019 23:27:53 +0200
Peter Zijlstra  escreveu:

> On Wed, Jun 19, 2019 at 10:19:22AM -0300, Mauro Carvalho Chehab wrote:
> > (c/c list cleaned)
> > 
> > Em Wed, 19 Jun 2019 13:43:56 +0200
> > Peter Zijlstra  escreveu:
> >   
> > > On Tue, Jun 18, 2019 at 05:53:17PM -0300, Mauro Carvalho Chehab wrote:
> > >   
> > > >  .../{ => driver-api}/atomic_bitops.rst|  2 -
> > > 
> > > That's a .txt file, big fat NAK for making it an rst.  
> > 
> > Rst is a text file. This one is parsed properly by Sphinx without
> > any changes.  
> 
> In my tree it is a .txt file, I've not seen patches changing it. And I
> disagree, rst is just as much 'a text file' as .c is.

ReStructured text is just text with a stricter style + some commands,
if the text author wants to enhance it.

Btw, I'm glad you mentioned c. 

This is c:

int
func( int a, int
 b ) {
 return a + b;
}

This is also c:

func(int a,int b) { goto foo;
foo:
   return(a+b) }

K style is also c, and this is also c:

#define f(a,b) (a+b)

Despite none of the above matches my taste - and some have issues - they
all build with gcc.

Yet, none of the above follows the Kernel coding style.

The way we use ReST (with absolute minimal changes), it becomes just
a text style.

Btw, I agree with you: there are some odd things at its style - and we 
should work to try to reduce this to its minimal extent.

> 
> > > >  .../{ => driver-api}/futex-requeue-pi.rst |  2 -
> > >   
> > > >  .../{ => driver-api}/gcc-plugins.rst  |  2 -
> > >   
> > > >  Documentation/{ => driver-api}/kprobes.rst|  2 -
> > > >  .../{ => driver-api}/percpu-rw-semaphore.rst  |  2 -
> > > 
> > > More NAK for rst conversion  
> > 
> > Again, those don't need any conversion. Those files already parse 
> > as-is by Sphinx, with no need for any change.  
> 
> And yet, they're a .txt file in my tree. And I've not seen a rename,
> just this move.

Rename is on patch 1/22.

No matter the extension, all the above files pass at the Sphinx style
validation without warnings or errors. Patch 1/22 doesn't make any
conversion.

Btw, the .rst extension is just a convenient way to help identifying what
was not validated. If I'm not mistaken, when the discussions about a
replacement for DocBook started at at linux-doc, someone proposed to
keep the .txt extension (changing it to accept .rst, .txt or both is
a single line change at conf.py).

> 
> > The only change here is that, on patch 1/22, the files that
> > aren't listed on an index file got a :orphan: added in order
> > to make this explicit. This patch removes it.  
> 
> I've no idea what :orphan: is. Text file don't have markup.
> 
> > > >  Documentation/{ => driver-api}/pi-futex.rst   |  2 -
> > > >  .../{ => driver-api}/preempt-locking.rst  |  2 -
> > >   
> > > >  Documentation/{ => driver-api}/rbtree.rst |  2 -
> > >   
> > > >  .../{ => driver-api}/robust-futex-ABI.rst |  2 -
> > > >  .../{ => driver-api}/robust-futexes.rst   |  2 -
> > >   
> > > >  .../{ => driver-api}/speculation.rst  |  8 +--
> > > >  .../{ => driver-api}/static-keys.rst  |  2 -
> > >   
> > > >  .../{ => driver-api}/this_cpu_ops.rst |  2 -
> > >   
> > > >  Documentation/locking/rt-mutex.rst|  2 +-
> > > 
> > > NAK. None of the above have anything to do with driver-api.  
> > 
> > Ok. Where do you think they should sit instead? core-api?  
> 
> Pretty much all of then are core-api I tihnk, with exception of the one
> that are ABI, which have nothing to do with API. 

OK.

> And i've no idea where
> GCC plugins go, but it's definitely nothing to do with drivers.

I suspect that Documentation/security would be a better place
for GCC plugins (as it has been discussed at kernel-hardening ML),
but I'm waiting a feedback from Kees.

> 
> Many of the futex ones are about the sys_futex user API, which
> apparently we have Documentation/userspace-api/ for.

Yeah, it makes sense to place sys_futex there.

Despite being an old dir, it is not too popular: there are
very few document there. I only discovered this one a few
days ago.

> 
> Why are you doing this if you've no clue what they're on about?

I don't pretend to know precisely where each document will fit.
If you read carefully the content of each orphaned document, you'll see
that many of them have uAPI, kAPI and admin-guide info inside.

To be frank, I actually tried to get rid of this document shift
part, but a Jon's feedback when I submitted a much simpler RFC
patchset challenged me to try to place each document on some place. The 
renaming part is by far a lot more complex than the conversion, 
because depending on how you interpret the file contents -
and the description of each documentation chapter - it may fit on a
different subdir.

-

My main goal is to have an organized body with the documentation. 

Try to read

[PATCH v5 16/25] khugepaged: skip collapse if uffd-wp detected

Don't collapse the huge PMD if there is any userfault write protected
small PTEs.  The problem is that the write protection is in small page
granularity and there's no way to keep all these write protection
information if the small pages are going to be merged into a huge PMD.

The same thing needs to be considered for swap entries and migration
entries.  So do the check as well disregarding khugepaged_max_ptes_swap.

Reviewed-by: Jerome Glisse 
Reviewed-by: Mike Rapoport 
Signed-off-by: Peter Xu 
---
 include/trace/events/huge_memory.h |  1 +
 mm/khugepaged.c| 23 +++
 2 files changed, 24 insertions(+)

diff --git a/include/trace/events/huge_memory.h 
b/include/trace/events/huge_memory.h
index dd4db334bd63..2d7bad9cb976 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -13,6 +13,7 @@
EM( SCAN_PMD_NULL,  "pmd_null") \
EM( SCAN_EXCEED_NONE_PTE,   "exceed_none_pte")  \
EM( SCAN_PTE_NON_PRESENT,   "pte_non_present")  \
+   EM( SCAN_PTE_UFFD_WP,   "pte_uffd_wp")  \
EM( SCAN_PAGE_RO,   "no_writable_page") \
EM( SCAN_LACK_REFERENCED_PAGE,  "lack_referenced_page") \
EM( SCAN_PAGE_NULL, "page_null")\
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 0f7419938008..fc40aa214be7 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -29,6 +29,7 @@ enum scan_result {
SCAN_PMD_NULL,
SCAN_EXCEED_NONE_PTE,
SCAN_PTE_NON_PRESENT,
+   SCAN_PTE_UFFD_WP,
SCAN_PAGE_RO,
SCAN_LACK_REFERENCED_PAGE,
SCAN_PAGE_NULL,
@@ -1128,6 +1129,15 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
pte_t pteval = *_pte;
if (is_swap_pte(pteval)) {
if (++unmapped <= khugepaged_max_ptes_swap) {
+   /*
+* Always be strict with uffd-wp
+* enabled swap entries.  Please see
+* comment below for pte_uffd_wp().
+*/
+   if (pte_swp_uffd_wp(pteval)) {
+   result = SCAN_PTE_UFFD_WP;
+   goto out_unmap;
+   }
continue;
} else {
result = SCAN_EXCEED_SWAP_PTE;
@@ -1147,6 +1157,19 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
result = SCAN_PTE_NON_PRESENT;
goto out_unmap;
}
+   if (pte_uffd_wp(pteval)) {
+   /*
+* Don't collapse the page if any of the small
+* PTEs are armed with uffd write protection.
+* Here we can also mark the new huge pmd as
+* write protected if any of the small ones is
+* marked but that could bring uknown
+* userfault messages that falls outside of
+* the registered range.  So, just be simple.
+*/
+   result = SCAN_PTE_UFFD_WP;
+   goto out_unmap;
+   }
if (pte_write(pteval))
writable = true;
 
-- 
2.21.0

[PATCH v5 17/25] userfaultfd: introduce helper vma_find_uffd

We've have multiple (and more coming) places that would like to find a
userfault enabled VMA from a mm struct that covers a specific memory
range.  This patch introduce the helper for it, meanwhile apply it to
the code.

Suggested-by: Mike Rapoport 
Reviewed-by: Jerome Glisse 
Reviewed-by: Mike Rapoport 
Signed-off-by: Peter Xu 
---
 mm/userfaultfd.c | 54 +++-
 1 file changed, 30 insertions(+), 24 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 5363376cb07a..6b9dd5b66f64 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -20,6 +20,34 @@
 #include 
 #include "internal.h"
 
+/*
+ * Find a valid userfault enabled VMA region that covers the whole
+ * address range, or NULL on failure.  Must be called with mmap_sem
+ * held.
+ */
+static struct vm_area_struct *vma_find_uffd(struct mm_struct *mm,
+   unsigned long start,
+   unsigned long len)
+{
+   struct vm_area_struct *vma = find_vma(mm, start);
+
+   if (!vma)
+   return NULL;
+
+   /*
+* Check the vma is registered in uffd, this is required to
+* enforce the VM_MAYWRITE check done at uffd registration
+* time.
+*/
+   if (!vma->vm_userfaultfd_ctx.ctx)
+   return NULL;
+
+   if (start < vma->vm_start || start + len > vma->vm_end)
+   return NULL;
+
+   return vma;
+}
+
 static int mcopy_atomic_pte(struct mm_struct *dst_mm,
pmd_t *dst_pmd,
struct vm_area_struct *dst_vma,
@@ -228,20 +256,9 @@ static __always_inline ssize_t 
__mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 */
if (!dst_vma) {
err = -ENOENT;
-   dst_vma = find_vma(dst_mm, dst_start);
+   dst_vma = vma_find_uffd(dst_mm, dst_start, len);
if (!dst_vma || !is_vm_hugetlb_page(dst_vma))
goto out_unlock;
-   /*
-* Check the vma is registered in uffd, this is
-* required to enforce the VM_MAYWRITE check done at
-* uffd registration time.
-*/
-   if (!dst_vma->vm_userfaultfd_ctx.ctx)
-   goto out_unlock;
-
-   if (dst_start < dst_vma->vm_start ||
-   dst_start + len > dst_vma->vm_end)
-   goto out_unlock;
 
err = -EINVAL;
if (vma_hpagesize != vma_kernel_pagesize(dst_vma))
@@ -487,20 +504,9 @@ static __always_inline ssize_t __mcopy_atomic(struct 
mm_struct *dst_mm,
 * both valid and fully within a single existing vma.
 */
err = -ENOENT;
-   dst_vma = find_vma(dst_mm, dst_start);
+   dst_vma = vma_find_uffd(dst_mm, dst_start, len);
if (!dst_vma)
goto out_unlock;
-   /*
-* Check the vma is registered in uffd, this is required to
-* enforce the VM_MAYWRITE check done at uffd registration
-* time.
-*/
-   if (!dst_vma->vm_userfaultfd_ctx.ctx)
-   goto out_unlock;
-
-   if (dst_start < dst_vma->vm_start ||
-   dst_start + len > dst_vma->vm_end)
-   goto out_unlock;
 
err = -EINVAL;
/*
-- 
2.21.0

[PATCH v5 12/25] userfaultfd: wp: apply _PAGE_UFFD_WP bit

Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
change_protection() when used with uffd-wp and make sure the two new
flags are exclusively used.  Then,

  - For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
when a range of memory is write protected by uffd

  - For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
_PAGE_RW when write protection is resolved from userspace

And use this new interface in mwriteprotect_range() to replace the old
MM_CP_DIRTY_ACCT.

Do this change for both PTEs and huge PMDs.  Then we can start to
identify which PTE/PMD is write protected by general (e.g., COW or soft
dirty tracking), and which is for userfaultfd-wp.

Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
into _PAGE_CHG_MASK as well.  Meanwhile, since we have this new bit, we
can be even more strict when detecting uffd-wp page faults in either
do_wp_page() or wp_huge_pmd().

After we're with _PAGE_UFFD_WP, a special case is when a page is both
protected by the general COW logic and also userfault-wp.  Here the
userfault-wp will have higher priority and will be handled first.
Only after the uffd-wp bit is cleared on the PTE/PMD will we continue
to handle the general COW.  These are the steps on what will happen
with such a page:

  1. CPU accesses write protected shared page (so both protected by
 general COW and uffd-wp), blocked by uffd-wp first because in
 do_wp_page we'll handle uffd-wp first, so it has higher priority
 than general COW.

  2. Uffd service thread receives the request, do UFFDIO_WRITEPROTECT
 to remove the uffd-wp bit upon the PTE/PMD.  However here we
 still keep the write bit cleared.  Notify the blocked CPU.

  3. The blocked CPU resumes the page fault process with a fault
 retry, during retry it'll notice it was not with the uffd-wp bit
 this time but it is still write protected by general COW, then
 it'll go though the COW path in the fault handler, copy the page,
 apply write bit where necessary, and retry again.

  4. The CPU will be able to access this page with write bit set.

Suggested-by: Andrea Arcangeli 
Signed-off-by: Peter Xu 
---
 include/linux/mm.h |  5 +
 mm/huge_memory.c   | 18 +-
 mm/memory.c|  4 ++--
 mm/mprotect.c  | 17 +
 mm/userfaultfd.c   |  8 ++--
 5 files changed, 47 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a93ac1c37940..beca76650271 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1719,6 +1719,11 @@ extern unsigned long move_page_tables(struct 
vm_area_struct *vma,
 #define  MM_CP_DIRTY_ACCT  (1UL << 0)
 /* Whether this protection change is for NUMA hints */
 #define  MM_CP_PROT_NUMA   (1UL << 1)
+/* Whether this change is for write protecting */
+#define  MM_CP_UFFD_WP (1UL << 2) /* do wp */
+#define  MM_CP_UFFD_WP_RESOLVE (1UL << 3) /* Resolve wp */
+#define  MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \
+   MM_CP_UFFD_WP_RESOLVE)
 
 extern unsigned long change_protection(struct vm_area_struct *vma, unsigned 
long start,
  unsigned long end, pgprot_t newprot,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b7149a0acac1..3fda79f6746b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1911,6 +1911,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
bool preserve_write;
int ret;
bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
+   bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
+   bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
 
ptl = __pmd_trans_huge_lock(pmd, vma);
if (!ptl)
@@ -1977,6 +1979,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
entry = pmd_modify(entry, newprot);
if (preserve_write)
entry = pmd_mk_savedwrite(entry);
+   if (uffd_wp) {
+   entry = pmd_wrprotect(entry);
+   entry = pmd_mkuffd_wp(entry);
+   } else if (uffd_wp_resolve) {
+   /*
+* Leave the write bit to be handled by PF interrupt
+* handler, then things like COW could be properly
+* handled.
+*/
+   entry = pmd_clear_uffd_wp(entry);
+   }
ret = HPAGE_PMD_NR;
set_pmd_at(mm, addr, pmd, entry);
BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
@@ -2125,7 +2138,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct 
*vma, pmd_t *pmd,
struct page *page;
pgtable_t pgtable;
pmd_t old_pmd, _pmd;
-   bool young, write, soft_dirty, pmd_migration = false;
+   bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
unsigned long addr;
int i;
 
@@ -2207,6 +2220,7 @@ static void

[PATCH v5 11/25] mm: merge parameters for change_protection()

change_protection() was used by either the NUMA or mprotect() code,
there's one parameter for each of the callers (dirty_accountable and
prot_numa).  Further, these parameters are passed along the calls:

  - change_protection_range()
  - change_p4d_range()
  - change_pud_range()
  - change_pmd_range()
  - ...

Now we introduce a flag for change_protect() and all these helpers to
replace these parameters.  Then we can avoid passing multiple parameters
multiple times along the way.

More importantly, it'll greatly simplify the work if we want to
introduce any new parameters to change_protection().  In the follow up
patches, a new parameter for userfaultfd write protection will be
introduced.

No functional change at all.

Reviewed-by: Jerome Glisse 
Signed-off-by: Peter Xu 
---
 include/linux/huge_mm.h |  2 +-
 include/linux/mm.h  | 14 +-
 mm/huge_memory.c|  3 ++-
 mm/mempolicy.c  |  2 +-
 mm/mprotect.c   | 29 -
 5 files changed, 33 insertions(+), 17 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7cd5c150c21d..a81a6ed609ac 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -46,7 +46,7 @@ extern bool move_huge_pmd(struct vm_area_struct *vma, 
unsigned long old_addr,
 pmd_t *old_pmd, pmd_t *new_pmd);
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, pgprot_t newprot,
-   int prot_numa);
+   unsigned long cp_flags);
 vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
 vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
 enum transparent_hugepage_flag {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index dcaca899e4a8..a93ac1c37940 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1708,9 +1708,21 @@ extern unsigned long move_page_tables(struct 
vm_area_struct *vma,
unsigned long old_addr, struct vm_area_struct *new_vma,
unsigned long new_addr, unsigned long len,
bool need_rmap_locks);
+
+/*
+ * Flags used by change_protection().  For now we make it a bitmap so
+ * that we can pass in multiple flags just like parameters.  However
+ * for now all the callers are only use one of the flags at the same
+ * time.
+ */
+/* Whether we should allow dirty bit accounting */
+#define  MM_CP_DIRTY_ACCT  (1UL << 0)
+/* Whether this protection change is for NUMA hints */
+#define  MM_CP_PROT_NUMA   (1UL << 1)
+
 extern unsigned long change_protection(struct vm_area_struct *vma, unsigned 
long start,
  unsigned long end, pgprot_t newprot,
- int dirty_accountable, int prot_numa);
+ unsigned long cp_flags);
 extern int mprotect_fixup(struct vm_area_struct *vma,
  struct vm_area_struct **pprev, unsigned long start,
  unsigned long end, unsigned long newflags);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9f8bce9a6b32..b7149a0acac1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1903,13 +1903,14 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned 
long old_addr,
  *  - HPAGE_PMD_NR is protections changed and TLB flush necessary
  */
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-   unsigned long addr, pgprot_t newprot, int prot_numa)
+   unsigned long addr, pgprot_t newprot, unsigned long cp_flags)
 {
struct mm_struct *mm = vma->vm_mm;
spinlock_t *ptl;
pmd_t entry;
bool preserve_write;
int ret;
+   bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
 
ptl = __pmd_trans_huge_lock(pmd, vma);
if (!ptl)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 01600d80ae01..dea6a49573e3 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -575,7 +575,7 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 {
int nr_updated;
 
-   nr_updated = change_protection(vma, addr, end, PAGE_NONE, 0, 1);
+   nr_updated = change_protection(vma, addr, end, PAGE_NONE, 
MM_CP_PROT_NUMA);
if (nr_updated)
count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index bf38dfbbb4b4..ae9caa4c6562 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,12 +37,14 @@
 
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end, pgprot_t newprot,
-   int dirty_accountable, int prot_numa)
+   unsigned long cp_flags)
 {
pte_t *pte, oldpte;
spinlock_t *ptl;
unsigned long pages = 0;
int target_node = NUMA_NO_NODE;
+   bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT;
+   bool prot_numa = cp_flags &

[PATCH v5 13/25] userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork

UFFD_EVENT_FORK support for uffd-wp should be already there, except
that we should clean the uffd-wp bit if uffd fork event is not
enabled.  Detect that to avoid _PAGE_UFFD_WP being set even if the VMA
is not being tracked by VM_UFFD_WP.  Do this for both small PTEs and
huge PMDs.

Reviewed-by: Jerome Glisse 
Reviewed-by: Mike Rapoport 
Signed-off-by: Peter Xu 
---
 mm/huge_memory.c | 8 
 mm/memory.c  | 8 
 2 files changed, 16 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3fda79f6746b..757975920df8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -980,6 +980,14 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
ret = -EAGAIN;
pmd = *src_pmd;
 
+   /*
+* Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA
+* does not have the VM_UFFD_WP, which means that the uffd
+* fork event is not enabled.
+*/
+   if (!(vma->vm_flags & VM_UFFD_WP))
+   pmd = pmd_clear_uffd_wp(pmd);
+
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
if (unlikely(is_swap_pmd(pmd))) {
swp_entry_t entry = pmd_to_swp_entry(pmd);
diff --git a/mm/memory.c b/mm/memory.c
index d79e6d1f8c62..8c69257d6ef1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -790,6 +790,14 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct 
*src_mm,
pte = pte_mkclean(pte);
pte = pte_mkold(pte);
 
+   /*
+* Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA
+* does not have the VM_UFFD_WP, which means that the uffd
+* fork event is not enabled.
+*/
+   if (!(vm_flags & VM_UFFD_WP))
+   pte = pte_clear_uffd_wp(pte);
+
page = vm_normal_page(vma, addr, pte);
if (page) {
get_page(page);
-- 
2.21.0

[PATCH v5 14/25] userfaultfd: wp: add pmd_swp_*uffd_wp() helpers

Adding these missing helpers for uffd-wp operations with pmd
swap/migration entries.

Reviewed-by: Jerome Glisse 
Reviewed-by: Mike Rapoport 
Signed-off-by: Peter Xu 
---
 arch/x86/include/asm/pgtable.h | 15 +++
 include/asm-generic/pgtable_uffd.h | 15 +++
 2 files changed, 30 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5b254b851082..0120fa671914 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1421,6 +1421,21 @@ static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
 {
return pte_clear_flags(pte, _PAGE_SWP_UFFD_WP);
 }
+
+static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd)
+{
+   return pmd_set_flags(pmd, _PAGE_SWP_UFFD_WP);
+}
+
+static inline int pmd_swp_uffd_wp(pmd_t pmd)
+{
+   return pmd_flags(pmd) & _PAGE_SWP_UFFD_WP;
+}
+
+static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
+{
+   return pmd_clear_flags(pmd, _PAGE_SWP_UFFD_WP);
+}
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
 
 #define PKRU_AD_BIT 0x1
diff --git a/include/asm-generic/pgtable_uffd.h 
b/include/asm-generic/pgtable_uffd.h
index 643d1bf559c2..828966d4c281 100644
--- a/include/asm-generic/pgtable_uffd.h
+++ b/include/asm-generic/pgtable_uffd.h
@@ -46,6 +46,21 @@ static __always_inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
 {
return pte;
 }
+
+static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd)
+{
+   return pmd;
+}
+
+static inline int pmd_swp_uffd_wp(pmd_t pmd)
+{
+   return 0;
+}
+
+static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
+{
+   return pmd;
+}
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
 
 #endif /* _ASM_GENERIC_PGTABLE_UFFD_H */
-- 
2.21.0

[PATCH v5 15/25] userfaultfd: wp: support swap and page migration

For either swap and page migration, we all use the bit 2 of the entry to
identify whether this entry is uffd write-protected.  It plays a similar
role as the existing soft dirty bit in swap entries but only for keeping
the uffd-wp tracking for a specific PTE/PMD.

Something special here is that when we want to recover the uffd-wp bit
from a swap/migration entry to the PTE bit we'll also need to take care
of the _PAGE_RW bit and make sure it's cleared, otherwise even with the
_PAGE_UFFD_WP bit we can't trap it at all.

In change_pte_range() we do nothing for uffd if the PTE is a swap
entry.  That can lead to data mismatch if the page that we are going
to write protect is swapped out when sending the UFFDIO_WRITEPROTECT.
This patch also applies/removes the uffd-wp bit even for the swap
entries.

Signed-off-by: Peter Xu 
---
 include/linux/swapops.h |  2 ++
 mm/huge_memory.c|  3 +++
 mm/memory.c |  8 
 mm/migrate.c|  6 ++
 mm/mprotect.c   | 28 +---
 mm/rmap.c   |  6 ++
 6 files changed, 42 insertions(+), 11 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 4d961668e5fc..0c2923b1cdb7 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -68,6 +68,8 @@ static inline swp_entry_t pte_to_swp_entry(pte_t pte)
 
if (pte_swp_soft_dirty(pte))
pte = pte_swp_clear_soft_dirty(pte);
+   if (pte_swp_uffd_wp(pte))
+   pte = pte_swp_clear_uffd_wp(pte);
arch_entry = __pte_to_swp_entry(pte);
return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 757975920df8..eae25c58db9d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2221,6 +2221,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct 
*vma, pmd_t *pmd,
write = is_write_migration_entry(entry);
young = false;
soft_dirty = pmd_swp_soft_dirty(old_pmd);
+   uffd_wp = pmd_swp_uffd_wp(old_pmd);
} else {
page = pmd_page(old_pmd);
if (pmd_dirty(old_pmd))
@@ -2253,6 +2254,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct 
*vma, pmd_t *pmd,
entry = swp_entry_to_pte(swp_entry);
if (soft_dirty)
entry = pte_swp_mksoft_dirty(entry);
+   if (uffd_wp)
+   entry = pte_swp_mkuffd_wp(entry);
} else {
entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
entry = maybe_mkwrite(entry, vma);
diff --git a/mm/memory.c b/mm/memory.c
index 8c69257d6ef1..28e9342d00cc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -738,6 +738,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct 
*src_mm,
pte = swp_entry_to_pte(entry);
if (pte_swp_soft_dirty(*src_pte))
pte = pte_swp_mksoft_dirty(pte);
+   if (pte_swp_uffd_wp(*src_pte))
+   pte = pte_swp_mkuffd_wp(pte);
set_pte_at(src_mm, addr, src_pte, pte);
}
} else if (is_device_private_entry(entry)) {
@@ -767,6 +769,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct 
*src_mm,
is_cow_mapping(vm_flags)) {
make_device_private_entry_read();
pte = swp_entry_to_pte(entry);
+   if (pte_swp_uffd_wp(*src_pte))
+   pte = pte_swp_mkuffd_wp(pte);
set_pte_at(src_mm, addr, src_pte, pte);
}
}
@@ -2930,6 +2934,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
flush_icache_page(vma, page);
if (pte_swp_soft_dirty(vmf->orig_pte))
pte = pte_mksoft_dirty(pte);
+   if (pte_swp_uffd_wp(vmf->orig_pte)) {
+   pte = pte_mkuffd_wp(pte);
+   pte = pte_wrprotect(pte);
+   }
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
vmf->orig_pte = pte;
diff --git a/mm/migrate.c b/mm/migrate.c
index f2ecc2855a12..d8f1f6d13960 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -241,11 +241,15 @@ static bool remove_migration_pte(struct page *page, 
struct vm_area_struct *vma,
entry = pte_to_swp_entry(*pvmw.pte);
if (is_write_migration_entry(entry))
pte = maybe_mkwrite(pte, vma);
+   else if (pte_swp_uffd_wp(*pvmw.pte))
+   pte = pte_mkuffd_wp(pte);
 
if (unlikely(is_zone_device_page(new))) {

[PATCH v5 06/25] userfaultfd: wp: add helper for writeprotect check

From: Shaohua Li 

add helper for writeprotect check. Will use it later.

Cc: Andrea Arcangeli 
Cc: Pavel Emelyanov 
Cc: Rik van Riel 
Cc: Kirill A. Shutemov 
Cc: Mel Gorman 
Cc: Hugh Dickins 
Cc: Johannes Weiner 
Signed-off-by: Shaohua Li 
Signed-off-by: Andrea Arcangeli 
Reviewed-by: Jerome Glisse 
Reviewed-by: Mike Rapoport 
Signed-off-by: Peter Xu 
---
 include/linux/userfaultfd_k.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index ac9d71e24b81..5dc247af0f2e 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -52,6 +52,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct 
*vma)
return vma->vm_flags & VM_UFFD_MISSING;
 }
 
+static inline bool userfaultfd_wp(struct vm_area_struct *vma)
+{
+   return vma->vm_flags & VM_UFFD_WP;
+}
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
@@ -96,6 +101,11 @@ static inline bool userfaultfd_missing(struct 
vm_area_struct *vma)
return false;
 }
 
+static inline bool userfaultfd_wp(struct vm_area_struct *vma)
+{
+   return false;
+}
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
return false;
-- 
2.21.0

[PATCH v5 10/25] userfaultfd: wp: add UFFDIO_COPY_MODE_WP

From: Andrea Arcangeli 

This allows UFFDIO_COPY to map pages write-protected.

Signed-off-by: Andrea Arcangeli 
[peterx: switch to VM_WARN_ON_ONCE in mfill_atomic_pte; add brackets
 around "dst_vma->vm_flags & VM_WRITE"; fix wordings in comments and
 commit messages]
Reviewed-by: Jerome Glisse 
Reviewed-by: Mike Rapoport 
Signed-off-by: Peter Xu 
---
 fs/userfaultfd.c |  5 +++--
 include/linux/userfaultfd_k.h|  2 +-
 include/uapi/linux/userfaultfd.h | 11 +-
 mm/userfaultfd.c | 36 ++--
 4 files changed, 35 insertions(+), 19 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 5dbef45ecbf5..c594945ad5bf 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1694,11 +1694,12 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
ret = -EINVAL;
if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
goto out;
-   if (uffdio_copy.mode & ~UFFDIO_COPY_MODE_DONTWAKE)
+   if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
goto out;
if (mmget_not_zero(ctx->mm)) {
ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
-  uffdio_copy.len, >mmap_changing);
+  uffdio_copy.len, >mmap_changing,
+  uffdio_copy.mode);
mmput(ctx->mm);
} else {
return -ESRCH;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 7b91b76aac58..dcd33172b728 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -36,7 +36,7 @@ extern vm_fault_t handle_userfault(struct vm_fault *vmf, 
unsigned long reason);
 
 extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
unsigned long src_start, unsigned long len,
-   bool *mmap_changing);
+   bool *mmap_changing, __u64 mode);
 extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
  unsigned long dst_start,
  unsigned long len,
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 48f1a7c2f1f0..340f23bc251d 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -203,13 +203,14 @@ struct uffdio_copy {
__u64 dst;
__u64 src;
__u64 len;
+#define UFFDIO_COPY_MODE_DONTWAKE  ((__u64)1<<0)
/*
-* There will be a wrprotection flag later that allows to map
-* pages wrprotected on the fly. And such a flag will be
-* available if the wrprotection ioctl are implemented for the
-* range according to the uffdio_register.ioctls.
+* UFFDIO_COPY_MODE_WP will map the page write protected on
+* the fly.  UFFDIO_COPY_MODE_WP is available only if the
+* write protected ioctl is implemented for the range
+* according to the uffdio_register.ioctls.
 */
-#define UFFDIO_COPY_MODE_DONTWAKE  ((__u64)1<<0)
+#define UFFDIO_COPY_MODE_WP((__u64)1<<1)
__u64 mode;
 
/*
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 9932d5755e4c..c8e7846e9b7e 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -25,7 +25,8 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
unsigned long src_addr,
-   struct page **pagep)
+   struct page **pagep,
+   bool wp_copy)
 {
struct mem_cgroup *memcg;
pte_t _dst_pte, *dst_pte;
@@ -71,9 +72,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
if (mem_cgroup_try_charge(page, dst_mm, GFP_KERNEL, , false))
goto out_release;
 
-   _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
-   if (dst_vma->vm_flags & VM_WRITE)
-   _dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
+   _dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
+   if ((dst_vma->vm_flags & VM_WRITE) && !wp_copy)
+   _dst_pte = pte_mkwrite(_dst_pte);
 
dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, );
if (dst_vma->vm_file) {
@@ -398,7 +399,8 @@ static __always_inline ssize_t mfill_atomic_pte(struct 
mm_struct *dst_mm,
unsigned long dst_addr,
unsigned long src_addr,
struct page **page,
-   bool zeropage)
+   bool zeropage,
+   bool wp_copy)
 {
ssize_t err;
 
@@ -415,11

[PATCH v5 07/25] userfaultfd: wp: hook userfault handler to write protection fault

From: Andrea Arcangeli 

There are several cases write protection fault happens. It could be a
write to zero page, swaped page or userfault write protected
page. When the fault happens, there is no way to know if userfault
write protect the page before. Here we just blindly issue a userfault
notification for vma with VM_UFFD_WP regardless if app write protects
it yet. Application should be ready to handle such wp fault.

v1: From: Shaohua Li 

v2: Handle the userfault in the common do_wp_page. If we get there a
pagetable is present and readonly so no need to do further processing
until we solve the userfault.

In the swapin case, always swapin as readonly. This will cause false
positive userfaults. We need to decide later if to eliminate them with
a flag like soft-dirty in the swap entry (see _PAGE_SWP_SOFT_DIRTY).

hugetlbfs wouldn't need to worry about swapouts but and tmpfs would
be handled by a swap entry bit like anonymous memory.

The main problem with no easy solution to eliminate the false
positives, will be if/when userfaultfd is extended to real filesystem
pagecache. When the pagecache is freed by reclaim we can't leave the
radix tree pinned if the inode and in turn the radix tree is reclaimed
as well.

The estimation is that full accuracy and lack of false positives could
be easily provided only to anonymous memory (as long as there's no
fork or as long as MADV_DONTFORK is used on the userfaultfd anonymous
range) tmpfs and hugetlbfs, it's most certainly worth to achieve it
but in a later incremental patch.

v3: Add hooking point for THP wrprotect faults.

CC: Shaohua Li 
Signed-off-by: Andrea Arcangeli 
[peterx: don't conditionally drop FAULT_FLAG_WRITE in do_swap_page]
Reviewed-by: Mike Rapoport 
Reviewed-by: Jerome Glisse 
Signed-off-by: Peter Xu 
---
 mm/memory.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index ddf20bd0c317..05bcd741855b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2579,6 +2579,11 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
 
+   if (userfaultfd_wp(vma)) {
+   pte_unmap_unlock(vmf->pte, vmf->ptl);
+   return handle_userfault(vmf, VM_UFFD_WP);
+   }
+
vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
if (!vmf->page) {
/*
@@ -3794,8 +3799,11 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault 
*vmf)
 /* `inline' is required to avoid gcc 4.1.2 build error */
 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
 {
-   if (vma_is_anonymous(vmf->vma))
+   if (vma_is_anonymous(vmf->vma)) {
+   if (userfaultfd_wp(vmf->vma))
+   return handle_userfault(vmf, VM_UFFD_WP);
return do_huge_pmd_wp_page(vmf, orig_pmd);
+   }
if (vmf->vma->vm_ops->huge_fault)
return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
 
-- 
2.21.0

[PATCH v5 09/25] userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers