date:20210401

[PATCH v3] fs: Improve eventpoll logging to stop indicting timerfd

2021-04-01 Thread Manish Varma

timerfd doesn't create any wakelocks, but eventpoll can.  When it does,
it names them after the underlying file descriptor, and since all
timerfd file descriptors are named "[timerfd]" (which saves memory on
systems like desktops with potentially many timerfd instances), all
wakesources created as a result of using the eventpoll-on-timerfd idiom
are called... "[timerfd]".

However, it becomes impossible to tell which "[timerfd]" wakesource is
affliated with which process and hence troubleshooting is difficult.

This change addresses this problem by changing the way eventpoll
wakesources are named:

1) the top-level per-process eventpoll wakesource is now named "epoll:P"
(instead of just "eventpoll"), where P, is the PID of the creating
process.
2) individual per-underlying-filedescriptor eventpoll wakesources are
now named "epollitemN:P.F", where N is a unique ID token and P is PID
of the creating process and F is the name of the underlying file
descriptor.

All together that should be splitted up into a change to eventpoll and
timerfd (or other file descriptors).

Reported-by: kernel test robot 
Co-developed-by: Kelly Rossmoyer 
Signed-off-by: Kelly Rossmoyer 
Signed-off-by: Manish Varma 
---
 drivers/base/power/wakeup.c | 10 --
 fs/eventpoll.c  | 10 --
 include/linux/pm_wakeup.h   |  4 ++--
 3 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/drivers/base/power/wakeup.c b/drivers/base/power/wakeup.c
index 01057f640233..3628536c67a5 100644
--- a/drivers/base/power/wakeup.c
+++ b/drivers/base/power/wakeup.c
@@ -216,13 +216,19 @@ EXPORT_SYMBOL_GPL(wakeup_source_remove);
 /**
  * wakeup_source_register - Create wakeup source and add it to the list.
  * @dev: Device this wakeup source is associated with (or NULL if virtual).
- * @name: Name of the wakeup source to register.
+ * @fmt: format string for the wakeup source name
  */
 struct wakeup_source *wakeup_source_register(struct device *dev,
-const char *name)
+const char *fmt, ...)
 {
struct wakeup_source *ws;
int ret;
+   char name[128];
+   va_list args;
+
+   va_start(args, fmt);
+   vsnprintf(name, sizeof(name), fmt, args);
+   va_end(args);
 
ws = wakeup_source_create(name);
if (ws) {
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 7df8c0fa462b..7c35987a8887 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -312,6 +312,7 @@ struct ctl_table epoll_table[] = {
 };
 #endif /* CONFIG_SYSCTL */
 
+static atomic_t wakesource_create_id  = ATOMIC_INIT(0);
 static const struct file_operations eventpoll_fops;
 
 static inline int is_file_epoll(struct file *f)
@@ -1451,15 +1452,20 @@ static int ep_create_wakeup_source(struct epitem *epi)
 {
struct name_snapshot n;
struct wakeup_source *ws;
+   pid_t task_pid;
+   int id;
+
+   task_pid = task_pid_nr(current);
 
if (!epi->ep->ws) {
-   epi->ep->ws = wakeup_source_register(NULL, "eventpoll");
+   epi->ep->ws = wakeup_source_register(NULL, "epoll:%d", 
task_pid);
if (!epi->ep->ws)
return -ENOMEM;
}
 
+   id = atomic_inc_return(_create_id);
take_dentry_name_snapshot(, epi->ffd.file->f_path.dentry);
-   ws = wakeup_source_register(NULL, n.name.name);
+   ws = wakeup_source_register(NULL, "epollitem%d:%d.%s", id, task_pid, 
n.name.name);
release_dentry_name_snapshot();
 
if (!ws)
diff --git a/include/linux/pm_wakeup.h b/include/linux/pm_wakeup.h
index aa3da6611533..cb91c84f6f08 100644
--- a/include/linux/pm_wakeup.h
+++ b/include/linux/pm_wakeup.h
@@ -95,7 +95,7 @@ extern void wakeup_source_destroy(struct wakeup_source *ws);
 extern void wakeup_source_add(struct wakeup_source *ws);
 extern void wakeup_source_remove(struct wakeup_source *ws);
 extern struct wakeup_source *wakeup_source_register(struct device *dev,
-   const char *name);
+   const char *fmt, ...);
 extern void wakeup_source_unregister(struct wakeup_source *ws);
 extern int wakeup_sources_read_lock(void);
 extern void wakeup_sources_read_unlock(int idx);
@@ -137,7 +137,7 @@ static inline void wakeup_source_add(struct wakeup_source 
*ws) {}
 static inline void wakeup_source_remove(struct wakeup_source *ws) {}
 
 static inline struct wakeup_source *wakeup_source_register(struct device *dev,
-  const char *name)
+  const char *fmt, ...)
 {
return NULL;
 }
-- 
2.31.0.208.g409f899ff0-goog

RE: [PATCH v3 1/2] PCI: xilinx-nwl: Enable coherent PCIe DMA traffic using CCI

2021-04-01 Thread Bharat Kumar Gogada

Hi Lorenzo,

Any inputs on this ?

Regards,
Bharat

> -Original Message-
> From: Bharat Kumar Gogada
> Sent: Tuesday, March 23, 2021 4:48 PM
> To: Bharat Kumar Gogada ; linux-
> p...@vger.kernel.org; linux-kernel@vger.kernel.org
> Cc: bhelg...@google.com; Lorenzo Pieralisi 
> Subject: RE: [PATCH v3 1/2] PCI: xilinx-nwl: Enable coherent PCIe DMA traffic
> using CCI
> 
> Ping.
> 
> > -Original Message-
> > From: Bharat Kumar Gogada
> > Sent: Monday, March 15, 2021 11:43 AM
> > To: Bharat Kumar Gogada ; linux-
> > p...@vger.kernel.org; linux-kernel@vger.kernel.org
> > Cc: bhelg...@google.com
> > Subject: RE: [PATCH v3 1/2] PCI: xilinx-nwl: Enable coherent PCIe DMA
> > traffic using CCI
> >
> > Ping.
> >
> > > -Original Message-
> > > From: Bharat Kumar Gogada 
> > > Sent: Monday, February 22, 2021 2:18 PM
> > > To: linux-...@vger.kernel.org; linux-kernel@vger.kernel.org
> > > Cc: bhelg...@google.com; Bharat Kumar Gogada 
> > > Subject: [PATCH v3 1/2] PCI: xilinx-nwl: Enable coherent PCIe DMA
> > > traffic using CCI
> > >
> > > Add support for routing PCIe DMA traffic coherently when Cache
> > > Coherent Interconnect (CCI) is enabled in the system.
> > > The "dma-coherent" property is used to determine if CCI is enabled or
> not.
> > > Refer to https://developer.arm.com/documentation/ddi0470/k/preface
> > > for the CCI specification.
> > >
> > > Signed-off-by: Bharat Kumar Gogada 
> > > ---
> > >  drivers/pci/controller/pcie-xilinx-nwl.c | 7 +++
> > >  1 file changed, 7 insertions(+)
> > >
> > > diff --git a/drivers/pci/controller/pcie-xilinx-nwl.c
> > > b/drivers/pci/controller/pcie-xilinx-nwl.c
> > > index 07e36661bbc2..8689311c5ef6 100644
> > > --- a/drivers/pci/controller/pcie-xilinx-nwl.c
> > > +++ b/drivers/pci/controller/pcie-xilinx-nwl.c
> > > @@ -26,6 +26,7 @@
> > >
> > >  /* Bridge core config registers */
> > >  #define BRCFG_PCIE_RX0   0x
> > > +#define BRCFG_PCIE_RX1   0x0004
> > >  #define BRCFG_INTERRUPT  0x0010
> > >  #define BRCFG_PCIE_RX_MSG_FILTER 0x0020
> > >
> > > @@ -128,6 +129,7 @@
> > >  #define NWL_ECAM_VALUE_DEFAULT   12
> > >
> > >  #define CFG_DMA_REG_BAR  GENMASK(2, 0)
> > > +#define CFG_PCIE_CACHE   GENMASK(7, 0)
> > >
> > >  #define INT_PCI_MSI_NR   (2 * 32)
> > >
> > > @@ -675,6 +677,11 @@ static int nwl_pcie_bridge_init(struct nwl_pcie
> > > *pcie)
> > >   nwl_bridge_writel(pcie, CFG_ENABLE_MSG_FILTER_MASK,
> > > BRCFG_PCIE_RX_MSG_FILTER);
> > >
> > > + /* This routes the PCIe DMA traffic to go through CCI path */
> > > + if (of_dma_is_coherent(dev->of_node))
> > > + nwl_bridge_writel(pcie, nwl_bridge_readl(pcie,
> > > BRCFG_PCIE_RX1) |
> > > +   CFG_PCIE_CACHE, BRCFG_PCIE_RX1);
> > > +
> > >   err = nwl_wait_for_link(pcie);
> > >   if (err)
> > >   return err;
> > > --
> > > 2.17.1

Re: [PATCH] drbd: Fix a use after free in get_initial_state

2021-04-01 Thread kernel test robot

Hi Lv,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on block/for-next]
[also build test WARNING on linux/master linus/master v5.12-rc5 next-20210401]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/Lv-Yunlong/drbd-Fix-a-use-after-free-in-get_initial_state/20210402-015401
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git 
for-next
config: x86_64-randconfig-s021-20210401 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
# apt-get install sparse
# sparse version: v0.6.3-279-g6d5d9b42-dirty
# 
https://github.com/0day-ci/linux/commit/af3f55d6c8730c5c1ce31fda165712091584adb0
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review 
Lv-Yunlong/drbd-Fix-a-use-after-free-in-get_initial_state/20210402-015401
git checkout af3f55d6c8730c5c1ce31fda165712091584adb0
# save the attached .config to linux build tree
make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 


sparse warnings: (new ones prefixed by >>)
>> drivers/block/drbd/drbd_nl.c:4957:1: sparse: sparse: unused label 'out'
   drivers/block/drbd/drbd_nl.c: note: in included file:
   include/linux/genl_magic_func.h:212:12: sparse: sparse: symbol 
'drbd_genl_cmd_to_str' was not declared. Should it be static?
   drivers/block/drbd/drbd_nl.c:454:33: sparse: sparse: incompatible types in 
comparison expression (different address spaces):
   drivers/block/drbd/drbd_nl.c:454:33: sparse:struct disk_conf [noderef] 
__rcu *
   drivers/block/drbd/drbd_nl.c:454:33: sparse:struct disk_conf *
   drivers/block/drbd/drbd_nl.c:691:38: sparse: sparse: incompatible types in 
comparison expression (different address spaces):
   drivers/block/drbd/drbd_nl.c:691:38: sparse:struct net_conf [noderef] 
__rcu *
   drivers/block/drbd/drbd_nl.c:691:38: sparse:struct net_conf *
   drivers/block/drbd/drbd_nl.c:793:40: sparse: sparse: mixing different enum 
types:
   drivers/block/drbd/drbd_nl.c:793:40: sparse:int enum drbd_state_rv
   drivers/block/drbd/drbd_nl.c:793:40: sparse:unsigned int enum 
drbd_ret_code
   drivers/block/drbd/drbd_nl.c:795:40: sparse: sparse: mixing different enum 
types:
   drivers/block/drbd/drbd_nl.c:795:40: sparse:int enum drbd_state_rv
   drivers/block/drbd/drbd_nl.c:795:40: sparse:unsigned int enum 
drbd_ret_code
   drivers/block/drbd/drbd_nl.c:980:18: sparse: sparse: incompatible types in 
comparison expression (different address spaces):
   drivers/block/drbd/drbd_nl.c:980:18: sparse:struct disk_conf [noderef] 
__rcu *
   drivers/block/drbd/drbd_nl.c:980:18: sparse:struct disk_conf *
   drivers/block/drbd/drbd_nl.c:1287:41: sparse: sparse: cast to restricted 
__be32
   drivers/block/drbd/drbd_nl.c:1287:41: sparse: sparse: cast to restricted 
__be32
   drivers/block/drbd/drbd_nl.c:1287:41: sparse: sparse: cast to restricted 
__be32
   drivers/block/drbd/drbd_nl.c:1287:41: sparse: sparse: cast to restricted 
__be32
   drivers/block/drbd/drbd_nl.c:1287:41: sparse: sparse: cast to restricted 
__be32
   drivers/block/drbd/drbd_nl.c:1287:41: sparse: sparse: cast to restricted 
__be32
   drivers/block/drbd/drbd_nl.c:1347:22: sparse: sparse: incompatible types in 
comparison expression (different address spaces):
   drivers/block/drbd/drbd_nl.c:1347:22: sparse:struct disk_conf [noderef] 
__rcu *
   drivers/block/drbd/drbd_nl.c:1347:22: sparse:struct disk_conf *
   drivers/block/drbd/drbd_nl.c:1639:17: sparse: sparse: incompatible types in 
comparison expression (different address spaces):
   drivers/block/drbd/drbd_nl.c:1639:17: sparse:struct disk_conf [noderef] 
__rcu *
   drivers/block/drbd/drbd_nl.c:1639:17: sparse:struct disk_conf *
   drivers/block/drbd/drbd_nl.c:1649:17: sparse: sparse: incompatible types in 
comparison expression (different address spaces):
   drivers/block/drbd/drbd_nl.c:1649:17: sparse:struct fifo_buffer 
[noderef] __rcu *
   drivers/block/drbd/drbd_nl.c:1649:17: sparse:struct fifo_buffer *
   drivers/block/drbd/drbd_nl.c:1872:14: sparse: sparse: incompatible types in 
comparison expression (different address spaces):
   drivers/block/drbd/drbd_nl.c:1872:14: sparse:struct net_conf [noderef] 
__rcu *
   drivers/block/drbd/drbd_nl.c:1872:14: sparse:struct net_conf *
   drivers/block/drbd/drbd_nl.c:2130:39: sparse: sparse: incompatible types in 
comparison expression (different address spaces):
   drivers/block/drbd/drbd_nl.c:2130:39: sparse:struct disk_conf [noderef] 
__rcu *
   drivers/block/drbd/drbd_nl.c:2130:39: sparse:struct disk_conf *
   drivers/block/drbd/drbd_

Re: [PATCH] iommu/vt-d: Force to flush iotlb before creating superpage

2021-04-01 Thread Lu Baolu

On 4/2/21 11:41 AM, Longpeng (Mike, Cloud Infrastructure Service Product 
Dept.) wrote:

Hi Baolu,

在 2021/4/2 11:06, Lu Baolu 写道:

Hi Longpeng,

On 4/1/21 3:18 PM, Longpeng(Mike) wrote:

The translation caches may preserve obsolete data when the
mapping size is changed, suppose the following sequence which
can reveal the problem with high probability.

1.mmap(4GB,MAP_HUGETLB)
2.
    while (1) {
     (a)    DMA MAP   0,0xa
     (b)    DMA UNMAP 0,0xa
     (c)    DMA MAP   0,0xc000
   * DMA read IOVA 0 may failure here (Not present)
   * if the problem occurs.
     (d)    DMA UNMAP 0,0xc000
    }

The page table(only focus on IOVA 0) after (a) is:
   PML4: 0x19db5c1003   entry:0x899bdcd2f000
    PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
     PDE: 0x1a30a72003  entry:0x89b39cacb000
  PTE: 0x21d200803  entry:0x89b3b0a72000

The page table after (b) is:
   PML4: 0x19db5c1003   entry:0x899bdcd2f000
    PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
     PDE: 0x1a30a72003  entry:0x89b39cacb000
  PTE: 0x0  entry:0x89b3b0a72000

The page table after (c) is:
   PML4: 0x19db5c1003   entry:0x899bdcd2f000
    PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
     PDE: 0x21d200883   entry:0x89b39cacb000 (*)

Because the PDE entry after (b) is present, it won't be
flushed even if the iommu driver flush cache when unmap,
so the obsolete data may be preserved in cache, which
would cause the wrong translation at end.

However, we can see the PDE entry is finally switch to
2M-superpage mapping, but it does not transform
to 0x21d200883 directly:

1. PDE: 0x1a30a72003
2. __domain_mapping
   dma_pte_free_pagetable
     Set the PDE entry to ZERO
   Set the PDE entry to 0x21d200883

So we must flush the cache after the entry switch to ZERO
to avoid the obsolete info be preserved.

Cc: David Woodhouse 
Cc: Lu Baolu 
Cc: Nadav Amit 
Cc: Alex Williamson 
Cc: Kevin Tian 
Cc: Gonglei (Arei) 

Fixes: 6491d4d02893 ("intel-iommu: Free old page tables before creating
superpage")
Cc:  # v3.0+
Link:
https://lore.kernel.org/linux-iommu/670baaf8-4ff8-4e84-4be3-030b95ab5...@huawei.com/

Suggested-by: Lu Baolu 
Signed-off-by: Longpeng(Mike) 
---
   drivers/iommu/intel/iommu.c | 15 +--
   1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ee09323..cbcb434 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2342,9 +2342,20 @@ static inline int hardware_largepage_caps(struct
dmar_domain *domain,
    * removed to make room for superpage(s).
    * We're adding new large pages, so make sure
    * we don't remove their parent tables.
+ *
+ * We also need to flush the iotlb before creating
+ * superpage to ensure it does not perserves any
+ * obsolete info.
    */
-    dma_pte_free_pagetable(domain, iov_pfn, end_pfn,
-   largepage_lvl + 1);
+    if (dma_pte_present(pte)) {
+    int i;
+
+    dma_pte_free_pagetable(domain, iov_pfn, end_pfn,
+   largepage_lvl + 1);
+    for_each_domain_iommu(i, domain)
+    iommu_flush_iotlb_psi(g_iommus[i], domain,
+  iov_pfn, nr_pages, 0, 0);


Thanks for patch!

How about making the flushed page size accurate? For example,

@@ -2365,8 +2365,8 @@ __domain_mapping(struct dmar_domain *domain, unsigned long
iov_pfn,
     dma_pte_free_pagetable(domain, iov_pfn,
end_pfn,

largepage_lvl + 1);
     for_each_domain_iommu(i, domain)
- iommu_flush_iotlb_psi(g_iommus[i], domain,
- iov_pfn, nr_pages, 0, 0);
+ iommu_flush_iotlb_psi(g_iommus[i], domain, iov_pfn,
+ ALIGN_DOWN(nr_pages, lvl_pages), 0, 0);


Yes, make sense.

Maybe another alternative is 'end_pfn - iova_pfn + 1', it's readable because we
free pagetable with (iova_pfn, end_pfn) above. Which one do you prefer?


Yours looks better.

By the way, if you are willing to prepare a v2, please make sure to add
Joerg (IOMMU subsystem maintainer) to the list.

Best regards,
baolu

[PATCH] thermal/drivers/cpuidle_cooling: Make sure that idle_duration is larger than residency

2021-04-01 Thread zhuguangqing83

From: Guangqing Zhu 

The injected idle duration should be greater than the idle state min
residency, otherwise we end up consuming more energy and potentially invert
the mitigation effect.

In function __cpuidle_cooling_register(), if
of_property_read_u32(np, "exit-latency-us", _us) is failed, then
maybe we should not use latency_us. In this case, a zero latency_us for
forced_idle_latency_limit_ns is better than UMAX_INT. It means to use
governors in the usual way.

Signed-off-by: Guangqing Zhu 
---
 drivers/powercap/idle_inject.c| 1 -
 drivers/thermal/cpuidle_cooling.c | 8 +++-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/powercap/idle_inject.c b/drivers/powercap/idle_inject.c
index 6e1a0043c411..d76eef1e9387 100644
--- a/drivers/powercap/idle_inject.c
+++ b/drivers/powercap/idle_inject.c
@@ -309,7 +309,6 @@ struct idle_inject_device *idle_inject_register(struct 
cpumask *cpumask)
cpumask_copy(to_cpumask(ii_dev->cpumask), cpumask);
hrtimer_init(_dev->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
ii_dev->timer.function = idle_inject_timer_fn;
-   ii_dev->latency_us = UINT_MAX;
 
for_each_cpu(cpu, to_cpumask(ii_dev->cpumask)) {
 
diff --git a/drivers/thermal/cpuidle_cooling.c 
b/drivers/thermal/cpuidle_cooling.c
index 7ecab4b16b29..de770eb5b2ba 100644
--- a/drivers/thermal/cpuidle_cooling.c
+++ b/drivers/thermal/cpuidle_cooling.c
@@ -175,7 +175,8 @@ static int __cpuidle_cooling_register(struct device_node 
*np,
struct cpuidle_cooling_device *idle_cdev;
struct thermal_cooling_device *cdev;
unsigned int idle_duration_us = TICK_USEC;
-   unsigned int latency_us = UINT_MAX;
+   unsigned int latency_us = 0;
+   unsigned int residency_us = UINT_MAX;
char dev_name[THERMAL_NAME_LENGTH];
int id, ret;
 
@@ -199,6 +200,11 @@ static int __cpuidle_cooling_register(struct device_node 
*np,
 
of_property_read_u32(np, "duration-us", _duration_us);
of_property_read_u32(np, "exit-latency-us", _us);
+   of_property_read_u32(np, "min-residency-us", _us);
+   if (idle_duration_us <= residency_us) {
+   ret = -EINVAL;
+   goto out_unregister;
+   }
 
idle_inject_set_duration(ii_dev, TICK_USEC, idle_duration_us);
idle_inject_set_latency(ii_dev, latency_us);
-- 
2.17.1

Re: [PATCH] iommu/vt-d: Force to flush iotlb before creating superpage

2021-04-01 Thread Lu Baolu


Hi Longpeng,

On 4/1/21 3:18 PM, Longpeng(Mike) wrote:

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ee09323..cbcb434 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2342,9 +2342,20 @@ static inline int hardware_largepage_caps(struct 
dmar_domain *domain,
 * removed to make room for superpage(s).
 * We're adding new large pages, so make sure
 * we don't remove their parent tables.
+*
+* We also need to flush the iotlb before 
creating
+* superpage to ensure it does not perserves any
+* obsolete info.
 */
-   dma_pte_free_pagetable(domain, iov_pfn, end_pfn,
-  largepage_lvl + 1);
+   if (dma_pte_present(pte)) {


The dma_pte_free_pagetable() clears a batch of PTEs. So checking current
PTE is insufficient. How about removing this check and always performing
cache invalidation?


+   int i;
+
+   dma_pte_free_pagetable(domain, iov_pfn, 
end_pfn,
+  largepage_lvl + 
1);
+   for_each_domain_iommu(i, domain)
+   
iommu_flush_iotlb_psi(g_iommus[i], domain,
+ iov_pfn, 
nr_pages, 0, 0);
+   


Best regards,
baolu

Re: [PATCH] riscv: Bump COMMAND_LINE_SIZE value to 1024

2021-04-01 Thread Palmer Dabbelt

On Tue, 30 Mar 2021 13:31:45 PDT (-0700), ma...@orcam.me.uk wrote:

On Mon, 29 Mar 2021, Palmer Dabbelt wrote:

> --- /dev/null
> +++ b/arch/riscv/include/uapi/asm/setup.h
> @@ -0,0 +1,8 @@
> +/* SPDX-License-Identifier: GPL-2.0-only WITH Linux-syscall-note */
> +
> +#ifndef _UAPI_ASM_RISCV_SETUP_H
> +#define _UAPI_ASM_RISCV_SETUP_H
> +
> +#define COMMAND_LINE_SIZE 1024
> +
> +#endif /* _UAPI_ASM_RISCV_SETUP_H */

I put this on fixes, but it seemes like this should really be a Kconfig
enttry.  Either way, ours was quite a bit smaller than most architectures and
it's great that syzbot has started to find bugs, so I'd rather get this in
sooner.

 This macro is exported as a part of the user API so it must not depend on
Kconfig.  Also changing it (rather than say adding COMMAND_LINE_SIZE_V2 or
switching to an entirely new data object that has its dimension set in a
different way) requires careful evaluation as external binaries have and
will have the value it expands to compiled in, so it's a part of the ABI
too.

Thanks, I didn't realize this was part of the user BI.  In that case we 
really can't chage it, so we'll have to sort out some other way do fix 
whatever is going on.

I've dropped this from fixes.

Re: linux-next: manual merge of the risc-v tree with Linus' tree

2021-04-01 Thread Palmer Dabbelt


On Tue, 30 Mar 2021 15:40:34 PDT (-0700), Stephen Rothwell wrote:

Hi all,

Today's linux-next merge of the risc-v tree got a conflict in:

  arch/riscv/mm/kasan_init.c

between commits:

  f3773dd031de ("riscv: Ensure page table writes are flushed when initializing KASAN 
vmalloc")
  78947bdfd752 ("RISC-V: kasan: Declare kasan_shallow_populate() static")

from Linus' tree and commit:

  2da073c19641 ("riscv: Cleanup KASAN_VMALLOC support")

from the risc-v tree.

I fixed it up (I think - see below) and can carry the fix as
necessary. This is now fixed as far as linux-next is concerned, but any
non trivial conflicts should be mentioned to your upstream maintainer
when your tree is submitted for merging.  You may also want to consider
cooperating with the maintainer of the conflicting tree to minimise any
particularly complex conflicts.


They're my own trees ;)

I'm not so great at reading merge diffs, but the right fix here is to 
have the local_flush_tlb_all() after the call to 
kasan_shallow_populate_pgd(), just as there is one after 
kasan_populate_pgd().  My merge diff looks like this


diff --cc arch/riscv/mm/kasan_init.c
index 2c39f0386673,4f85c6d0ddf8..ec0029097251
--- a/arch/riscv/mm/kasan_init.c
+++ b/arch/riscv/mm/kasan_init.c
@@@ -162,8 -159,36 +162,10 @@@ static void __init kasan_shallow_popula
 {
   unsigned long vaddr = (unsigned long)start & PAGE_MASK;
   unsigned long vend = PAGE_ALIGN((unsigned long)end);
-  unsigned long pfn;
-  int index;
-  void *p;
-  pud_t *pud_dir, *pud_k;
-  pgd_t *pgd_dir, *pgd_k;
-  p4d_t *p4d_dir, *p4d_k;
-
-  while (vaddr < vend) {
-  index = pgd_index(vaddr);
-  pfn = csr_read(CSR_SATP) & SATP_PPN;
-  pgd_dir = (pgd_t *)pfn_to_virt(pfn) + index;
-  pgd_k = init_mm.pgd + index;
-  pgd_dir = pgd_offset_k(vaddr);
-  set_pgd(pgd_dir, *pgd_k);
-
-  p4d_dir = p4d_offset(pgd_dir, vaddr);
-  p4d_k  = p4d_offset(pgd_k, vaddr);
-
-  vaddr = (vaddr + PUD_SIZE) & PUD_MASK;
-  pud_dir = pud_offset(p4d_dir, vaddr);
-  pud_k = pud_offset(p4d_k, vaddr);
-
-  if (pud_present(*pud_dir)) {
-  p = early_alloc(PAGE_SIZE, NUMA_NO_NODE);
-  pud_populate(_mm, pud_dir, p);
-  }
-  vaddr += PAGE_SIZE;
-  }
+
+  kasan_shallow_populate_pgd(vaddr, vend);
+
+   local_flush_tlb_all();
 }

 void __init kasan_init(void)

which doesn't include the diff to kasan_shallow_populate_pgd().  Not 
sure if that's just because my diff is in the other direction, though.  
The expected result is that kasan_shallow_populate_pgd() exists both pre 
and post merge.




--
Cheers,
Stephen Rothwell

diff --cc arch/riscv/mm/kasan_init.c
index 4f85c6d0ddf8,2c39f0386673..
--- a/arch/riscv/mm/kasan_init.c
+++ b/arch/riscv/mm/kasan_init.c
@@@ -153,44 -141,31 +141,33 @@@ static void __init kasan_populate(void 
  
  	local_flush_tlb_all();

memset(start, KASAN_SHADOW_INIT, end - start);
  }
  
+ static void __init kasan_shallow_populate_pgd(unsigned long vaddr, unsigned long end)

+ {
+   unsigned long next;
+   void *p;
+   pgd_t *pgd_k = pgd_offset_k(vaddr);
+ 
+ 	do {

+   next = pgd_addr_end(vaddr, end);
+   if (pgd_page_vaddr(*pgd_k) == (unsigned 
long)lm_alias(kasan_early_shadow_pmd)) {
+   p = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
+   set_pgd(pgd_k, pfn_pgd(PFN_DOWN(__pa(p)), PAGE_TABLE));
+   }
+   } while (pgd_k++, vaddr = next, vaddr != end);
+ }
+ 
  static void __init kasan_shallow_populate(void *start, void *end)

  {
unsigned long vaddr = (unsigned long)start & PAGE_MASK;
unsigned long vend = PAGE_ALIGN((unsigned long)end);
-   unsigned long pfn;
-   int index;
-   void *p;
-   pud_t *pud_dir, *pud_k;
-   pgd_t *pgd_dir, *pgd_k;
-   p4d_t *p4d_dir, *p4d_k;
- 
- 	while (vaddr < vend) {

-   index = pgd_index(vaddr);
-   pfn = csr_read(CSR_SATP) & SATP_PPN;
-   pgd_dir = (pgd_t *)pfn_to_virt(pfn) + index;
-   pgd_k = init_mm.pgd + index;
-   pgd_dir = pgd_offset_k(vaddr);
-   set_pgd(pgd_dir, *pgd_k);
- 
- 		p4d_dir = p4d_offset(pgd_dir, vaddr);

-   p4d_k  = p4d_offset(pgd_k, vaddr);
- 
- 		vaddr = (vaddr + PUD_SIZE) & PUD_MASK;

-   pud_dir = pud_offset(p4d_dir, vaddr);
-   pud_k = pud_offset(p4d_k, vaddr);
- 
- 		if (pud_present(*pud_dir)) {

-   p = early_alloc(PAGE_SIZE, NUMA_NO_NODE);
-   pud_populate(_mm, pud_dir, p);
-   }
-   vaddr += PAGE_SIZE;
-   }
+ 
+ 	kasan_shallow_populate_pgd(vaddr, vend);

 +
 +  local_flush_tlb_all();
  }
  
  void __init kasan_init(void)

  {
phys_addr_t _start, _end;

Re: [PATCH v5] iommu/tegra-smmu: Add pagetable mappings to debugfs

2021-04-01 Thread Nicolin Chen

On Tue, Mar 16, 2021 at 12:16:43PM +0100, Thierry Reding wrote:

> > +struct tegra_smmu_group_debug {
> > +   const struct tegra_smmu_swgroup *group;
> > +   void *priv;
> 
> This always stores the address space, so why not make this:
> 
>   struct tegra_smmu_as *as;
> 
> ? While at it, perhaps throw in a const to make sure we don't modify
> this structure in the debugfs code.

OK. I will try to change that.

> > @@ -334,7 +350,7 @@ static void tegra_smmu_domain_free(struct iommu_domain 
> > *domain)
> >  }
> >  
> >  static const struct tegra_smmu_swgroup *
> > -tegra_smmu_find_swgroup(struct tegra_smmu *smmu, unsigned int swgroup)
> > +tegra_smmu_find_swgroup(struct tegra_smmu *smmu, unsigned int swgroup, int 
> > *index)
> >  {
> > const struct tegra_smmu_swgroup *group = NULL;
> > unsigned int i;
> > @@ -342,6 +358,8 @@ tegra_smmu_find_swgroup(struct tegra_smmu *smmu, 
> > unsigned int swgroup)
> > for (i = 0; i < smmu->soc->num_swgroups; i++) {
> > if (smmu->soc->swgroups[i].swgroup == swgroup) {
> > group = >soc->swgroups[i];
> > +   if (index)
> > +   *index = i;
> 
> This doesn't look like the right place for this. And this also makes
> things hard to follow because it passes out-of-band data in the index
> parameter.
> 
> I'm thinking that this could benefit from a bit of refactoring where
> we could for example embed struct tegra_smmu_group_debug into struct
> tegra_smmu_group and then reference that when necessary instead of
> carrying all that data in an orthogonal array. That should also make
> it easier to track this.
> 
> Come to think of it, everything that's currently in your new struct
> tegra_smmu_group_debug would be useful in struct tegra_smmu_group,
> irrespective of debugfs support.

Will try to embed it or see what I can do following the suggestion.

> > +static int tegra_smmu_mappings_show(struct seq_file *s, void *data)
> > +{

> > +   seq_printf(s, "\nSWGROUP: %s\nASID: %d\nreg: 0x%x\n",
> > +  group->name, as->id, group->reg);
> 
> Is group->reg really that useful here?

Can drop it.

> > +
> > +   smmu_writel(smmu, as->id & 0x7f, SMMU_PTB_ASID);
> > +   ptb_reg = smmu_readl(smmu, SMMU_PTB_DATA);
> > +
> > +   seq_printf(s, "PTB_ASID: 0x%x\nas->pd_dma: %pad\n",
> > +  ptb_reg, >pd_dma);
> 
> This looks somewhat redundant because as->pd_dma is already part of the
> PTB_ASID register value. Instead, perhaps decode the upper bits of that
> register and simply output as->pdma so that we don't duplicate the base
> address of the page table?

That's a good idea. Will change that too.

Re: [PATCH v2 8/9] riscv: module: Create module allocations without exec permissions

2021-04-01 Thread Anup Patel

On Wed, Mar 31, 2021 at 10:04 PM Jisheng Zhang
 wrote:
>
> From: Jisheng Zhang 
>
> The core code manages the executable permissions of code regions of
> modules explicitly, it is not necessary to create the module vmalloc
> regions with RWX permissions. Create them with RW- permissions instead.
>
> Signed-off-by: Jisheng Zhang 

Looks good to me.

Reviewed-by: Anup Patel 

Regards,
Anup

> ---
>  arch/riscv/kernel/module.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
> index 104fba889cf7..e89367bba7c9 100644
> --- a/arch/riscv/kernel/module.c
> +++ b/arch/riscv/kernel/module.c
> @@ -407,14 +407,20 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char 
> *strtab,
> return 0;
>  }
>
> -#if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
> +#ifdef CONFIG_MMU
> +
> +#ifdef CONFIG_64BIT
>  #define VMALLOC_MODULE_START \
>  max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
> +#else
> +#define VMALLOC_MODULE_START   VMALLOC_START
> +#endif
> +
>  void *module_alloc(unsigned long size)
>  {
> return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
> VMALLOC_END, GFP_KERNEL,
> -   PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
> +   PAGE_KERNEL, 0, NUMA_NO_NODE,
> __builtin_return_address(0));
>  }
>  #endif
> --
> 2.31.0
>
>
>
> ___
> linux-riscv mailing list
> linux-ri...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv

Re: [PATCH v2 5/9] riscv: kprobes: Implement alloc_insn_page()

2021-04-01 Thread Anup Patel

On Wed, Mar 31, 2021 at 10:02 PM Jisheng Zhang
 wrote:
>
> From: Jisheng Zhang 
>
> Allocate PAGE_KERNEL_READ_EXEC(read only, executable) page for kprobes
> insn page. This is to prepare for STRICT_MODULE_RWX.
>
> Signed-off-by: Jisheng Zhang 

Looks good to me.

Reviewed-by: Anup Patel 

Regards,
Anup

> ---
>  arch/riscv/kernel/probes/kprobes.c | 8 
>  1 file changed, 8 insertions(+)
>
> diff --git a/arch/riscv/kernel/probes/kprobes.c 
> b/arch/riscv/kernel/probes/kprobes.c
> index 7e2c78e2ca6b..8c1f7a30aeed 100644
> --- a/arch/riscv/kernel/probes/kprobes.c
> +++ b/arch/riscv/kernel/probes/kprobes.c
> @@ -84,6 +84,14 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p)
> return 0;
>  }
>
> +void *alloc_insn_page(void)
> +{
> +   return  __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END,
> +GFP_KERNEL, PAGE_KERNEL_READ_EXEC,
> +VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
> +__builtin_return_address(0));
> +}
> +
>  /* install breakpoint in text */
>  void __kprobes arch_arm_kprobe(struct kprobe *p)
>  {
> --
> 2.31.0
>
>
>
> ___
> linux-riscv mailing list
> linux-ri...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv

Re: [PATCH v2 9/9] riscv: Set ARCH_HAS_STRICT_MODULE_RWX if MMU

2021-04-01 Thread Anup Patel

On Wed, Mar 31, 2021 at 10:05 PM Jisheng Zhang
 wrote:
>
> From: Jisheng Zhang 
>
> Now we can set ARCH_HAS_STRICT_MODULE_RWX for MMU riscv platforms, this
> is good from security perspective.
>
> Signed-off-by: Jisheng Zhang 

Looks good to me.

Reviewed-by: Anup Patel 

Regards,
Anup

> ---
>  arch/riscv/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index 87d7b52f278f..9716be3674a2 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -28,6 +28,7 @@ config RISCV
> select ARCH_HAS_SET_DIRECT_MAP
> select ARCH_HAS_SET_MEMORY
> select ARCH_HAS_STRICT_KERNEL_RWX if MMU
> +   select ARCH_HAS_STRICT_MODULE_RWX if MMU
> select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
> select ARCH_OPTIONAL_KERNEL_RWX_DEFAULT
> select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT if MMU
> --
> 2.31.0
>
>
>
> ___
> linux-riscv mailing list
> linux-ri...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv

Re: [PATCH v2 4/9] riscv: Constify sbi_ipi_ops

2021-04-01 Thread Anup Patel

On Wed, Mar 31, 2021 at 10:02 PM Jisheng Zhang
 wrote:
>
> From: Jisheng Zhang 
>
> Constify the sbi_ipi_ops so that it will be placed in the .rodata
> section. This will cause attempts to modify it to fail when strict
> page permissions are in place.
>
> Signed-off-by: Jisheng Zhang 

Looks good to me.

Reviewed-by: Anup Patel 

Regards,
Anup

> ---
>  arch/riscv/include/asm/smp.h | 4 ++--
>  arch/riscv/kernel/sbi.c  | 2 +-
>  arch/riscv/kernel/smp.c  | 4 ++--
>  3 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/arch/riscv/include/asm/smp.h b/arch/riscv/include/asm/smp.h
> index df1f7c4cd433..a7d2811f3536 100644
> --- a/arch/riscv/include/asm/smp.h
> +++ b/arch/riscv/include/asm/smp.h
> @@ -46,7 +46,7 @@ int riscv_hartid_to_cpuid(int hartid);
>  void riscv_cpuid_to_hartid_mask(const struct cpumask *in, struct cpumask 
> *out);
>
>  /* Set custom IPI operations */
> -void riscv_set_ipi_ops(struct riscv_ipi_ops *ops);
> +void riscv_set_ipi_ops(const struct riscv_ipi_ops *ops);
>
>  /* Clear IPI for current CPU */
>  void riscv_clear_ipi(void);
> @@ -92,7 +92,7 @@ static inline void riscv_cpuid_to_hartid_mask(const struct 
> cpumask *in,
> cpumask_set_cpu(boot_cpu_hartid, out);
>  }
>
> -static inline void riscv_set_ipi_ops(struct riscv_ipi_ops *ops)
> +static inline void riscv_set_ipi_ops(const struct riscv_ipi_ops *ops)
>  {
>  }
>
> diff --git a/arch/riscv/kernel/sbi.c b/arch/riscv/kernel/sbi.c
> index cbd94a72eaa7..cb848e80865e 100644
> --- a/arch/riscv/kernel/sbi.c
> +++ b/arch/riscv/kernel/sbi.c
> @@ -556,7 +556,7 @@ static void sbi_send_cpumask_ipi(const struct cpumask 
> *target)
> sbi_send_ipi(cpumask_bits(_mask));
>  }
>
> -static struct riscv_ipi_ops sbi_ipi_ops = {
> +static const struct riscv_ipi_ops sbi_ipi_ops = {
> .ipi_inject = sbi_send_cpumask_ipi
>  };
>
> diff --git a/arch/riscv/kernel/smp.c b/arch/riscv/kernel/smp.c
> index 504284d49135..e035124f06dc 100644
> --- a/arch/riscv/kernel/smp.c
> +++ b/arch/riscv/kernel/smp.c
> @@ -85,9 +85,9 @@ static void ipi_stop(void)
> wait_for_interrupt();
>  }
>
> -static struct riscv_ipi_ops *ipi_ops __ro_after_init;
> +static const struct riscv_ipi_ops *ipi_ops __ro_after_init;
>
> -void riscv_set_ipi_ops(struct riscv_ipi_ops *ops)
> +void riscv_set_ipi_ops(const struct riscv_ipi_ops *ops)
>  {
> ipi_ops = ops;
>  }
> --
> 2.31.0
>
>
>
> ___
> linux-riscv mailing list
> linux-ri...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv

[tip:x86/core] BUILD SUCCESS f31390437ce984118215169d75570e365457ec23

2021-04-01 Thread kernel test robot

 ecovec24_defconfig
mips  maltaaprp_defconfig
powerpc   eiger_defconfig
ia64 bigsur_defconfig
xtensa   common_defconfig
arm  ixp4xx_defconfig
mips tb0219_defconfig
mipsmalta_qemu_32r6_defconfig
arm   corgi_defconfig
m68k   m5208evb_defconfig
mips   capcella_defconfig
powerpc mpc837x_rdb_defconfig
ia64 allyesconfig
sh   se7750_defconfig
riscvnommu_k210_defconfig
arm lpc18xx_defconfig
arm  moxart_defconfig
s390 allyesconfig
um   allmodconfig
arm palmz72_defconfig
sh   sh2007_defconfig
powerpc  iss476-smp_defconfig
arm  jornada720_defconfig
ia64 allmodconfig
ia64defconfig
m68k allmodconfig
m68kdefconfig
m68k allyesconfig
nios2   defconfig
arc  allyesconfig
nds32 allnoconfig
nds32   defconfig
nios2allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
parisc  defconfig
parisc   allyesconfig
i386defconfig
sparcallyesconfig
sparc   defconfig
mips allyesconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
i386 randconfig-a006-20210401
i386 randconfig-a003-20210401
i386 randconfig-a001-20210401
i386 randconfig-a004-20210401
i386 randconfig-a002-20210401
i386 randconfig-a005-20210401
i386 randconfig-a006-20210402
i386 randconfig-a003-20210402
i386 randconfig-a001-20210402
i386 randconfig-a004-20210402
i386 randconfig-a005-20210402
i386 randconfig-a002-20210402
x86_64   randconfig-a014-20210401
x86_64   randconfig-a015-20210401
x86_64   randconfig-a011-20210401
x86_64   randconfig-a013-20210401
x86_64   randconfig-a012-20210401
x86_64   randconfig-a016-20210401
i386 randconfig-a014-20210401
i386 randconfig-a011-20210401
i386 randconfig-a016-20210401
i386 randconfig-a012-20210401
i386 randconfig-a013-20210401
i386 randconfig-a015-20210401
riscvnommu_virt_defconfig
riscv allnoconfig
umallnoconfig
um   allyesconfig
um  defconfig
x86_64rhel-8.3-kselftests
x86_64  defconfig
x86_64   rhel-8.3
x86_64  rhel-8.3-kbuiltin
x86_64  kexec

clang tested configs:
x86_64   randconfig-a004-20210401
x86_64   randconfig-a005-20210401
x86_64   randconfig-a003-20210401
x86_64   randconfig-a001-20210401
x86_64   randconfig-a002-20210401
x86_64   randconfig-a006-20210401

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org

Re: [PATCH v2 3/9] riscv: Constify sys_call_table

2021-04-01 Thread Anup Patel

On Wed, Mar 31, 2021 at 10:01 PM Jisheng Zhang
 wrote:
>
> From: Jisheng Zhang 
>
> Constify the sys_call_table so that it will be placed in the .rodata
> section. This will cause attempts to modify the table to fail when
> strict page permissions are in place.
>
> Signed-off-by: Jisheng Zhang 

Looks good to me.

Reviewed-by: Anup Patel 

Regards,
Anup

> ---
>  arch/riscv/include/asm/syscall.h  | 2 +-
>  arch/riscv/kernel/syscall_table.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/riscv/include/asm/syscall.h 
> b/arch/riscv/include/asm/syscall.h
> index 49350c8bd7b0..b933b1583c9f 100644
> --- a/arch/riscv/include/asm/syscall.h
> +++ b/arch/riscv/include/asm/syscall.h
> @@ -15,7 +15,7 @@
>  #include 
>
>  /* The array of function pointers for syscalls. */
> -extern void *sys_call_table[];
> +extern void * const sys_call_table[];
>
>  /*
>   * Only the low 32 bits of orig_r0 are meaningful, so we return int.
> diff --git a/arch/riscv/kernel/syscall_table.c 
> b/arch/riscv/kernel/syscall_table.c
> index f1ead9df96ca..a63c667c27b3 100644
> --- a/arch/riscv/kernel/syscall_table.c
> +++ b/arch/riscv/kernel/syscall_table.c
> @@ -13,7 +13,7 @@
>  #undef __SYSCALL
>  #define __SYSCALL(nr, call)[nr] = (call),
>
> -void *sys_call_table[__NR_syscalls] = {
> +void * const sys_call_table[__NR_syscalls] = {
> [0 ... __NR_syscalls - 1] = sys_ni_syscall,
>  #include 
>  };
> --
> 2.31.0
>
>
>
> ___
> linux-riscv mailing list
> linux-ri...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv

Re: [PATCH v2 2/9] riscv: Mark some global variables __ro_after_init

2021-04-01 Thread Anup Patel

On Wed, Mar 31, 2021 at 10:01 PM Jisheng Zhang
 wrote:
>
> From: Jisheng Zhang 
>
> All of these are never modified after init, so they can be
> __ro_after_init.
>
> Signed-off-by: Jisheng Zhang 

Looks good to me.

Reviewed-by: Anup Patel 

Regards,
Anup

> ---
>  arch/riscv/kernel/sbi.c  | 8 
>  arch/riscv/kernel/smp.c  | 4 ++--
>  arch/riscv/kernel/time.c | 2 +-
>  arch/riscv/kernel/vdso.c | 4 ++--
>  arch/riscv/mm/init.c | 6 +++---
>  5 files changed, 12 insertions(+), 12 deletions(-)
>
> diff --git a/arch/riscv/kernel/sbi.c b/arch/riscv/kernel/sbi.c
> index d3bf756321a5..cbd94a72eaa7 100644
> --- a/arch/riscv/kernel/sbi.c
> +++ b/arch/riscv/kernel/sbi.c
> @@ -11,14 +11,14 @@
>  #include 
>
>  /* default SBI version is 0.1 */
> -unsigned long sbi_spec_version = SBI_SPEC_VERSION_DEFAULT;
> +unsigned long sbi_spec_version __ro_after_init = SBI_SPEC_VERSION_DEFAULT;
>  EXPORT_SYMBOL(sbi_spec_version);
>
> -static void (*__sbi_set_timer)(uint64_t stime);
> -static int (*__sbi_send_ipi)(const unsigned long *hart_mask);
> +static void (*__sbi_set_timer)(uint64_t stime) __ro_after_init;
> +static int (*__sbi_send_ipi)(const unsigned long *hart_mask) __ro_after_init;
>  static int (*__sbi_rfence)(int fid, const unsigned long *hart_mask,
>unsigned long start, unsigned long size,
> -  unsigned long arg4, unsigned long arg5);
> +  unsigned long arg4, unsigned long arg5) 
> __ro_after_init;
>
>  struct sbiret sbi_ecall(int ext, int fid, unsigned long arg0,
> unsigned long arg1, unsigned long arg2,
> diff --git a/arch/riscv/kernel/smp.c b/arch/riscv/kernel/smp.c
> index ea028d9e0d24..504284d49135 100644
> --- a/arch/riscv/kernel/smp.c
> +++ b/arch/riscv/kernel/smp.c
> @@ -30,7 +30,7 @@ enum ipi_message_type {
> IPI_MAX
>  };
>
> -unsigned long __cpuid_to_hartid_map[NR_CPUS] = {
> +unsigned long __cpuid_to_hartid_map[NR_CPUS] __ro_after_init = {
> [0 ... NR_CPUS-1] = INVALID_HARTID
>  };
>
> @@ -85,7 +85,7 @@ static void ipi_stop(void)
> wait_for_interrupt();
>  }
>
> -static struct riscv_ipi_ops *ipi_ops;
> +static struct riscv_ipi_ops *ipi_ops __ro_after_init;
>
>  void riscv_set_ipi_ops(struct riscv_ipi_ops *ops)
>  {
> diff --git a/arch/riscv/kernel/time.c b/arch/riscv/kernel/time.c
> index 1b432264f7ef..8217b0f67c6c 100644
> --- a/arch/riscv/kernel/time.c
> +++ b/arch/riscv/kernel/time.c
> @@ -11,7 +11,7 @@
>  #include 
>  #include 
>
> -unsigned long riscv_timebase;
> +unsigned long riscv_timebase __ro_after_init;
>  EXPORT_SYMBOL_GPL(riscv_timebase);
>
>  void __init time_init(void)
> diff --git a/arch/riscv/kernel/vdso.c b/arch/riscv/kernel/vdso.c
> index 3f1d35e7c98a..25a3b8849599 100644
> --- a/arch/riscv/kernel/vdso.c
> +++ b/arch/riscv/kernel/vdso.c
> @@ -20,8 +20,8 @@
>
>  extern char vdso_start[], vdso_end[];
>
> -static unsigned int vdso_pages;
> -static struct page **vdso_pagelist;
> +static unsigned int vdso_pages __ro_after_init;
> +static struct page **vdso_pagelist __ro_after_init;
>
>  /*
>   * The vDSO data page.
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 76bf2de8aa59..719ec72ef069 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -149,11 +149,11 @@ void __init setup_bootmem(void)
>  }
>
>  #ifdef CONFIG_MMU
> -static struct pt_alloc_ops pt_ops;
> +static struct pt_alloc_ops pt_ops __ro_after_init;
>
> -unsigned long va_pa_offset;
> +unsigned long va_pa_offset __ro_after_init;
>  EXPORT_SYMBOL(va_pa_offset);
> -unsigned long pfn_base;
> +unsigned long pfn_base __ro_after_init;
>  EXPORT_SYMBOL(pfn_base);
>
>  pgd_t swapper_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
> --
> 2.31.0
>
>
>
> ___
> linux-riscv mailing list
> linux-ri...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv

Re: [PATCH 0/4] mm/page_reporting: Some knobs and fixes

2021-04-01 Thread Xunlei Pang

On 3/26/21 5:44 PM, Xunlei Pang wrote:
> Add the following knobs in PATCH 1~3:
>  /sys/kernel/mm/page_reporting/reported_kbytes
>  /sys/kernel/mm/page_reporting/refault_kbytes
>  /sys/kernel/mm/page_reporting/reporting_factor
> 
> Fix unexpected user OOM in PATCH 4.
> 
> Xunlei Pang (4):
>   mm/page_reporting: Introduce free page reported counters
>   mm/page_reporting: Introduce free page reporting factor
>   mm/page_reporting: Introduce "page_reporting_factor=" boot parameter
>   mm/page_reporting: Fix possible user allocation failure
> 
>  Documentation/admin-guide/kernel-parameters.txt |   3 +
>  include/linux/mmzone.h  |   3 +
>  mm/page_alloc.c |   6 +-
>  mm/page_reporting.c | 268 
> ++--
>  4 files changed, 260 insertions(+), 20 deletions(-)
> 

Hi guys,

Looks "Alexander Duyck " was not
available, so Cced more, any comment?

Thanks!

Re: [PATCH v2 1/9] riscv: add __init section marker to some functions

2021-04-01 Thread Anup Patel

On Wed, Mar 31, 2021 at 10:00 PM Jisheng Zhang
 wrote:
>
> From: Jisheng Zhang 
>
> They are not needed after booting, so mark them as __init to move them
> to the __init section.
>
> Signed-off-by: Jisheng Zhang 
> ---
>  arch/riscv/kernel/traps.c  | 2 +-
>  arch/riscv/mm/init.c   | 6 +++---
>  arch/riscv/mm/kasan_init.c | 6 +++---
>  arch/riscv/mm/ptdump.c | 2 +-
>  4 files changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
> index 1357abf79570..07fdded10c21 100644
> --- a/arch/riscv/kernel/traps.c
> +++ b/arch/riscv/kernel/traps.c
> @@ -197,6 +197,6 @@ int is_valid_bugaddr(unsigned long pc)
>  #endif /* CONFIG_GENERIC_BUG */
>
>  /* stvec & scratch is already set from head.S */
> -void trap_init(void)
> +void __init trap_init(void)
>  {
>  }

The trap_init() is unused currently so you can drop this change
and remove trap_init() as a separate patch.

> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 067583ab1bd7..76bf2de8aa59 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -57,7 +57,7 @@ static void __init zone_sizes_init(void)
> free_area_init(max_zone_pfns);
>  }
>
> -static void setup_zero_page(void)
> +static void __init setup_zero_page(void)
>  {
> memset((void *)empty_zero_page, 0, PAGE_SIZE);
>  }
> @@ -75,7 +75,7 @@ static inline void print_mlm(char *name, unsigned long b, 
> unsigned long t)
>   (((t) - (b)) >> 20));
>  }
>
> -static void print_vm_layout(void)
> +static void __init print_vm_layout(void)
>  {
> pr_notice("Virtual kernel memory layout:\n");
> print_mlk("fixmap", (unsigned long)FIXADDR_START,
> @@ -557,7 +557,7 @@ static inline void setup_vm_final(void)
>  #endif /* CONFIG_MMU */
>
>  #ifdef CONFIG_STRICT_KERNEL_RWX
> -void protect_kernel_text_data(void)
> +void __init protect_kernel_text_data(void)
>  {
> unsigned long text_start = (unsigned long)_start;
> unsigned long init_text_start = (unsigned long)__init_text_begin;
> diff --git a/arch/riscv/mm/kasan_init.c b/arch/riscv/mm/kasan_init.c
> index 4f85c6d0ddf8..e1d041ac1534 100644
> --- a/arch/riscv/mm/kasan_init.c
> +++ b/arch/riscv/mm/kasan_init.c
> @@ -60,7 +60,7 @@ asmlinkage void __init kasan_early_init(void)
> local_flush_tlb_all();
>  }
>
> -static void kasan_populate_pte(pmd_t *pmd, unsigned long vaddr, unsigned 
> long end)
> +static void __init kasan_populate_pte(pmd_t *pmd, unsigned long vaddr, 
> unsigned long end)
>  {
> phys_addr_t phys_addr;
> pte_t *ptep, *base_pte;
> @@ -82,7 +82,7 @@ static void kasan_populate_pte(pmd_t *pmd, unsigned long 
> vaddr, unsigned long en
> set_pmd(pmd, pfn_pmd(PFN_DOWN(__pa(base_pte)), PAGE_TABLE));
>  }
>
> -static void kasan_populate_pmd(pgd_t *pgd, unsigned long vaddr, unsigned 
> long end)
> +static void __init kasan_populate_pmd(pgd_t *pgd, unsigned long vaddr, 
> unsigned long end)
>  {
> phys_addr_t phys_addr;
> pmd_t *pmdp, *base_pmd;
> @@ -117,7 +117,7 @@ static void kasan_populate_pmd(pgd_t *pgd, unsigned long 
> vaddr, unsigned long en
> set_pgd(pgd, pfn_pgd(PFN_DOWN(__pa(base_pmd)), PAGE_TABLE));
>  }
>
> -static void kasan_populate_pgd(unsigned long vaddr, unsigned long end)
> +static void __init kasan_populate_pgd(unsigned long vaddr, unsigned long end)
>  {
> phys_addr_t phys_addr;
> pgd_t *pgdp = pgd_offset_k(vaddr);
> diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c
> index ace74dec7492..3b7b6e4d025e 100644
> --- a/arch/riscv/mm/ptdump.c
> +++ b/arch/riscv/mm/ptdump.c
> @@ -331,7 +331,7 @@ static int ptdump_show(struct seq_file *m, void *v)
>
>  DEFINE_SHOW_ATTRIBUTE(ptdump);
>
> -static int ptdump_init(void)
> +static int __init ptdump_init(void)
>  {
> unsigned int i, j;
>
> --
> 2.31.0
>
>
>
> ___
> linux-riscv mailing list
> linux-ri...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv

Apart from above, looks good to me.

Reviewed-by: Anup Patel 

Regards,
Anup

[PATCH v1 2/2] driver core: Improve fw_devlink & deferred_probe_timeout interaction

2021-04-01 Thread Saravana Kannan

deferred_probe_timeout kernel commandline parameter allows probing of
consumer devices if the supplier devices don't have any drivers.

fw_devlink=on will indefintely block probe() calls on a device if all
its suppliers haven't probed successfully. This completely skips calls
to driver_deferred_probe_check_state() since that's only called when a
.probe() function calls framework APIs. So fw_devlink=on breaks
deferred_probe_timeout.

deferred_probe_timeout in its current state also ignores a lot of
information that's now available to the kernel. It assumes all suppliers
that haven't probed when the timer expires (or when initcalls are done
on a static kernel) will never probe and fails any calls to acquire
resources from these unprobed suppliers.

However, this assumption by deferred_probe_timeout isn't true under many
conditions. For example:
- If the consumer happens to be before the supplier in the deferred
  probe list.
- If the supplier itself is waiting on its supplier to probe.

This patch fixes both these issues by relaxing device links between
devices only if the supplier doesn't have any driver that could match
with (NOT bound to) the supplier device. This way, we only fail attempts
to acquire resources from suppliers that truly don't have any driver vs
suppliers that just happen to not have probed yet.

Signed-off-by: Saravana Kannan 
---
 drivers/base/base.h |  1 +
 drivers/base/core.c | 64 -
 drivers/base/dd.c   |  5 
 3 files changed, 63 insertions(+), 7 deletions(-)

diff --git a/drivers/base/base.h b/drivers/base/base.h
index 1b44ed588f66..e5f9b7e656c3 100644
--- a/drivers/base/base.h
+++ b/drivers/base/base.h
@@ -191,6 +191,7 @@ extern void device_links_driver_cleanup(struct device *dev);
 extern void device_links_no_driver(struct device *dev);
 extern bool device_links_busy(struct device *dev);
 extern void device_links_unbind_consumers(struct device *dev);
+extern void fw_devlink_drivers_done(void);
 
 /* device pm support */
 void device_pm_move_to_tail(struct device *dev);
diff --git a/drivers/base/core.c b/drivers/base/core.c
index de518178ac36..c05dae75b696 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -51,6 +51,7 @@ static LIST_HEAD(deferred_sync);
 static unsigned int defer_sync_state_count = 1;
 static DEFINE_MUTEX(fwnode_link_lock);
 static bool fw_devlink_is_permissive(void);
+static bool fw_devlink_drv_reg_done;
 
 /**
  * fwnode_link_add - Create a link between two fwnode_handles.
@@ -1598,6 +1599,52 @@ static void fw_devlink_parse_fwtree(struct fwnode_handle 
*fwnode)
fw_devlink_parse_fwtree(child);
 }
 
+static void fw_devlink_relax_link(struct device_link *link)
+{
+   if (!(link->flags & DL_FLAG_INFERRED))
+   return;
+
+   if (link->flags == (DL_FLAG_MANAGED | FW_DEVLINK_FLAGS_PERMISSIVE))
+   return;
+
+   pm_runtime_drop_link(link);
+   link->flags = DL_FLAG_MANAGED | FW_DEVLINK_FLAGS_PERMISSIVE;
+   dev_dbg(link->consumer, "Relaxing link with %s\n",
+   dev_name(link->supplier));
+}
+
+static int fw_devlink_no_driver(struct device *dev, void *data)
+{
+   struct device_link *link = to_devlink(dev);
+
+   if (!link->supplier->can_match)
+   fw_devlink_relax_link(link);
+
+   return 0;
+}
+
+void fw_devlink_drivers_done(void)
+{
+   fw_devlink_drv_reg_done = true;
+   device_links_write_lock();
+   class_for_each_device(_class, NULL, NULL,
+ fw_devlink_no_driver);
+   device_links_write_unlock();
+}
+
+static void fw_devlink_unblock_consumers(struct device *dev)
+{
+   struct device_link *link;
+
+   if (!fw_devlink_flags || fw_devlink_is_permissive())
+   return;
+
+   device_links_write_lock();
+   list_for_each_entry(link, >links.consumers, s_node)
+   fw_devlink_relax_link(link);
+   device_links_write_unlock();
+}
+
 /**
  * fw_devlink_relax_cycle - Convert cyclic links to SYNC_STATE_ONLY links
  * @con: Device to check dependencies for.
@@ -1634,13 +1681,7 @@ static int fw_devlink_relax_cycle(struct device *con, 
void *sup)
 
ret = 1;
 
-   if (!(link->flags & DL_FLAG_INFERRED))
-   continue;
-
-   pm_runtime_drop_link(link);
-   link->flags = DL_FLAG_MANAGED | FW_DEVLINK_FLAGS_PERMISSIVE;
-   dev_dbg(link->consumer, "Relaxing link with %s\n",
-   dev_name(link->supplier));
+   fw_devlink_relax_link(link);
}
return ret;
 }
@@ -3275,6 +3316,15 @@ int device_add(struct device *dev)
}
 
bus_probe_device(dev);
+
+   /*
+* If all driver registration is done and a newly added device doesn't
+* match with any driver, don't block its consumers from probing in
+* case the consumer device is able to operate without this supplier.
+*/
+   if

[PATCH v1 0/2] Fix deferred_probe_timeout and fw_devlink=on

2021-04-01 Thread Saravana Kannan

This series fixes existing bugs in deferred_probe_timeout and fixes some
interaction with fw_devlink=on.

Saravana Kannan (2):
  driver core: Fix locking bug in deferred_probe_timeout_work_func()
  driver core: Improve fw_devlink & deferred_probe_timeout interaction

 drivers/base/base.h |  1 +
 drivers/base/core.c | 64 -
 drivers/base/dd.c   | 13 ++---
 3 files changed, 68 insertions(+), 10 deletions(-)

-- 
2.31.0.208.g409f899ff0-goog

[PATCH v1 1/2] driver core: Fix locking bug in deferred_probe_timeout_work_func()

2021-04-01 Thread Saravana Kannan

list_for_each_entry_safe() is only useful if we are deleting nodes in a
linked list within the loop. It doesn't protect against other threads
adding/deleting nodes to the list in parallel. We need to grab
deferred_probe_mutex when traversing the deferred_probe_pending_list.

Cc: sta...@vger.kernel.org
Fixes: 25b4e70dcce9 ("driver core: allow stopping deferred probe after init")
Signed-off-by: Saravana Kannan 
---
 drivers/base/dd.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 20b69b5e0e91..28ad8afd87bc 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -291,14 +291,16 @@ int driver_deferred_probe_check_state(struct device *dev)
 
 static void deferred_probe_timeout_work_func(struct work_struct *work)
 {
-   struct device_private *private, *p;
+   struct device_private *p;
 
driver_deferred_probe_timeout = 0;
driver_deferred_probe_trigger();
flush_work(_probe_work);
 
-   list_for_each_entry_safe(private, p, _probe_pending_list, 
deferred_probe)
-   dev_info(private->device, "deferred probe pending\n");
+   mutex_lock(_probe_mutex);
+   list_for_each_entry(p, _probe_pending_list, deferred_probe)
+   dev_info(p->device, "deferred probe pending\n");
+   mutex_unlock(_probe_mutex);
wake_up_all(_timeout_waitqueue);
 }
 static DECLARE_DELAYED_WORK(deferred_probe_timeout_work, 
deferred_probe_timeout_work_func);
-- 
2.31.0.208.g409f899ff0-goog

Re: [External] Re: [RFC PATCH 00/15] Use obj_cgroup APIs to charge the LRU pages

2021-04-01 Thread Muchun Song

On Fri, Apr 2, 2021 at 6:55 AM Yang Shi  wrote:
>
> On Wed, Mar 31, 2021 at 8:17 AM Johannes Weiner  wrote:
> >
> > On Tue, Mar 30, 2021 at 03:05:42PM -0700, Roman Gushchin wrote:
> > > On Tue, Mar 30, 2021 at 05:30:10PM -0400, Johannes Weiner wrote:
> > > > On Tue, Mar 30, 2021 at 11:58:31AM -0700, Roman Gushchin wrote:
> > > > > On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote:
> > > > > > On Tue, Mar 30, 2021 at 3:20 AM Muchun Song 
> > > > > >  wrote:
> > > > > > >
> > > > > > > Since the following patchsets applied. All the kernel memory are 
> > > > > > > charged
> > > > > > > with the new APIs of obj_cgroup.
> > > > > > >
> > > > > > > [v17,00/19] The new cgroup slab memory controller
> > > > > > > [v5,0/7] Use obj_cgroup APIs to charge kmem pages
> > > > > > >
> > > > > > > But user memory allocations (LRU pages) pinning memcgs for a long 
> > > > > > > time -
> > > > > > > it exists at a larger scale and is causing recurring problems in 
> > > > > > > the real
> > > > > > > world: page cache doesn't get reclaimed for a long time, or is 
> > > > > > > used by the
> > > > > > > second, third, fourth, ... instance of the same job that was 
> > > > > > > restarted into
> > > > > > > a new cgroup every time. Unreclaimable dying cgroups pile up, 
> > > > > > > waste memory,
> > > > > > > and make page reclaim very inefficient.
> > > > > > >
> > > > > > > We can convert LRU pages and most other raw memcg pins to the 
> > > > > > > objcg direction
> > > > > > > to fix this problem, and then the LRU pages will not pin the 
> > > > > > > memcgs.
> > > > > > >
> > > > > > > This patchset aims to make the LRU pages to drop the reference to 
> > > > > > > memory
> > > > > > > cgroup by using the APIs of obj_cgroup. Finally, we can see that 
> > > > > > > the number
> > > > > > > of the dying cgroups will not increase if we run the following 
> > > > > > > test script.
> > > > > > >
> > > > > > > ```bash
> > > > > > > #!/bin/bash
> > > > > > >
> > > > > > > cat /proc/cgroups | grep memory
> > > > > > >
> > > > > > > cd /sys/fs/cgroup/memory
> > > > > > >
> > > > > > > for i in range{1..500}
> > > > > > > do
> > > > > > > mkdir test
> > > > > > > echo $$ > test/cgroup.procs
> > > > > > > sleep 60 &
> > > > > > > echo $$ > cgroup.procs
> > > > > > > echo `cat test/cgroup.procs` > cgroup.procs
> > > > > > > rmdir test
> > > > > > > done
> > > > > > >
> > > > > > > cat /proc/cgroups | grep memory
> > > > > > > ```
> > > > > > >
> > > > > > > Patch 1 aims to fix page charging in page replacement.
> > > > > > > Patch 2-5 are code cleanup and simplification.
> > > > > > > Patch 6-15 convert LRU pages pin to the objcg direction.
> > > > > >
> > > > > > The main concern I have with *just* reparenting LRU pages is that 
> > > > > > for
> > > > > > the long running systems, the root memcg will become a dumping 
> > > > > > ground.
> > > > > > In addition a job running multiple times on a machine will see
> > > > > > inconsistent memory usage if it re-accesses the file pages which 
> > > > > > were
> > > > > > reparented to the root memcg.
> > > > >
> > > > > I agree, but also the reparenting is not the perfect thing in a 
> > > > > combination
> > > > > with any memory protections (e.g. memory.low).
> > > > >
> > > > > Imagine the following configuration:
> > > > > workload.slice
> > > > > - workload_gen_1.service   memory.min = 30G
> > > > > - workload_gen_2.service   memory.min = 30G
> > > > > - workload_gen_3.service   memory.min = 30G
> > > > >   ...
> > > > >
> > > > > Parent cgroup and several generations of the child cgroup, protected 
> > > > > by a memory.low.
> > > > > Once the memory is getting reparented, it's not protected anymore.
> > > >
> > > > That doesn't sound right.
> > > >
> > > > A deleted cgroup today exerts no control over its abandoned
> > > > pages. css_reset() will blow out any control settings.
> > >
> > > I know. Currently it works in the following way: once cgroup gen_1 is 
> > > deleted,
> > > it's memory is not protected anymore, so eventually it's getting evicted 
> > > and
> > > re-faulted as gen_2 (or gen_N) memory. Muchun's patchset doesn't change 
> > > this,
> > > of course. But long-term we likely wanna re-charge such pages to new 
> > > cgroups
> > > and avoid unnecessary evictions and re-faults. Switching to obj_cgroups 
> > > doesn't
> > > help and likely will complicate this change. So I'm a bit skeptical here.
> >
> > We should be careful with the long-term plans.
>
> Excuse me for a dumb question. I recall we did reparent LRU pages
> before (before 4.x kernel). I vaguely recall there were some tricky
> race conditions during reparenting so we didn't do it anymore once
> reclaimer could reclaim from offlined memcgs. My memory may be wrong,
> if it is not so please feel free to correct me. If my memory is true,

I looked at the historical information by git. Your memory is right.

> it means the race

Re: [PATCH v2] Documentation/translations/zh_CN/dev-tools/

2021-04-01 Thread Wu X.C.

Hi Bernard,

On Thu, Apr 01, 2021 at 06:27:16AM -0700, Bernard Zhao wrote:

Why the charset in your email header is 'y' ?
"Content-Type: text/plain; charset=y"


> Add translations to dev-tools gcov
> 
> Signed-off-by: Bernard Zhao 
> Reviewed-by: Wu X.C 
  ^
  This reviewed-by tag is invalid.

Please do not pick review-by tag before one give it.

> ---
> Changes since V1:
> * add index.rst in dev-tools and link to to zh_CN/index.rst
> * fix some inaccurate translation
> 
> Link for V1:
> *https://lore.kernel.org/patchwork/patch/1405740/
> ---
>  .../translations/zh_CN/dev-tools/gcov.rst | 279 ++
>  .../translations/zh_CN/dev-tools/index.rst|  39 +++
>  Documentation/translations/zh_CN/index.rst|   1 +
>  3 files changed, 319 insertions(+)
>  create mode 100644 Documentation/translations/zh_CN/dev-tools/gcov.rst
>  create mode 100644 Documentation/translations/zh_CN/dev-tools/index.rst
> 
> diff --git a/Documentation/translations/zh_CN/dev-tools/gcov.rst 
> b/Documentation/translations/zh_CN/dev-tools/gcov.rst
> new file mode 100644
> index ..e8ffb99b566d

Why replaced all '，' '。' with ',' '.' in zh_CN/dev-tools/gcov.rst ？
And also the columns in the v2 are much shorter than v1.
Please recover the above two points.

> --- /dev/null
> +++ b/Documentation/translations/zh_CN/dev-tools/gcov.rst
> @@ -0,0 +1,279 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +.. include:: ../disclaimer-zh_CN.rst
> +
> +:Original: :ref:`Documentation/dev-tools/gcov.rst `

the original text have no article tag, simply use this
:Original: Documentation/dev-tools/gcov.rst

> +:Translator: 赵军奎 Bernard Zhao 
> +
> +.. _dev-tools_gcov:

Please remove above line, no need

> +
> +在Linux内核里使用gcov做代码覆盖率检查
> +

Still got a lot of warning.
Please using monospaced font to fix this.
Build log:

/test/linux/Documentation/translations/zh_CN/dev-tools/gcov.rst:11: WARNING: 
Title underline too short.

在Linux内核里使用gcov做代码覆盖率检查

/test/linux/Documentation/translations/zh_CN/dev-tools/gcov.rst:110: WARNING: 
Title underline too short.

针对模块的统计
---
/test/linux/Documentation/translations/zh_CN/dev-tools/gcov.rst:110: WARNING: 
Title underline too short.

针对模块的统计
---
/test/linux/Documentation/translations/zh_CN/dev-tools/gcov.rst:154: WARNING: 
Block quote ends without a blank line; unexpected unindent.
/test/linux/Documentation/translations/zh_CN/dev-tools/gcov.rst:179: WARNING: 
Title underline too short.

关于编译器的注意事项
-
/test/linux/Documentation/translations/zh_CN/dev-tools/gcov.rst:179: WARNING: 
Title underline too short.

关于编译器的注意事项
-


> +
> +gcov是linux中已经集成的一个分析模块,该模块在内核中对
> +GCC的代码覆盖率统计提供了支持.
> +linux内核运行时的代码覆盖率数据会以gcov兼容的格式存储
> +在debug-fs中,可以通过gcov的“-o”选项（如下示例）获得
> +指定文件的代码运行覆盖率统计数据（需要跳转到内核编
> +译路径下并且要有root权限）::
> +
> +# cd /tmp/linux-out
> +# gcov -o /sys/kernel/debug/gcov/tmp/linux-out/kernel spinlock.c
> +
> +这将在当前目录中创建带有执行计数注释的源代码文件.
> +在获得这些统计文件后,可以使用图形化的gcov[1]前端工
> +具（比如lcov[2]）,来实现自动化处理linux 内核的覆
> +盖率运行数据,同时生成易于阅读的HTML格式文件.

Sorry for the inconvenience,
I test again, found the url tags seems would not make namespace conflicts.
Thus,

在获得这些统计文件后,可以使用图形化的 gcov_ 前端工
具（比如 lcov_ ）,来实现自动化处理linux 内核的覆

> +
> +可能的用途:
> +
> +* 调试（用来判断每一行的代码是否已经运行过）
> +* 测试改进（如何修改测试代码,尽可能地覆盖到没有运
> +  行过的代码）
> +* 内核配置优化（对于某一个选项配置,如果关联的代码
> +  从来没有运行过,是否还需要这个配置）
> +
> +[1]_gcov: https://gcc.gnu.org/onlinedocs/gcc/Gcov.html
> +[2]_lcov: http://ltp.sourceforge.net/coverage/lcov.php

.. _gcov: https://gcc.gnu.org/onlinedocs/gcc/Gcov.html
.. _lcov: http://ltp.sourceforge.net/coverage/lcov.php

> +
> +
> +准备
> +---
> +
> +内核打开如下配置::
> +
> +CONFIG_DEBUG_FS=y
> +CONFIG_GCOV_KERNEL=y
> +
> +获取整个内核的覆盖率数据,还需要打开::
> +
> +CONFIG_GCOV_PROFILE_ALL=y
> +
> +需要注意的是,整个内核开启覆盖率统计会造成内核镜像
> +文件尺寸的增大,同时内核运行的也会变慢一些.
> +另外,并不是所有的架构都支持整个内核开启覆盖率统计
> +
> +代码运行覆盖率数据只在debugfs挂载完成后才可以访问::
> +
> +mount -t debugfs none /sys/kernel/debug
> +
> +
> +客制化
> +-
> +
> +如果要单独针对某一个路径或者文件进行代码覆盖率统计
> +可以在内核相应路径的Makefile中增加如下的配置:
> +
> +- 单独统计单个文件（例如main.o）::
> +
> +GCOV_PROFILE_main.o := y
> +
> +- 单独统计某一个路径::
> +
> +GCOV_PROFILE := y
> +
> +如果要在整个内核的覆盖率统计（CONFIG_GCOV_PROFILE_ALL）
> +中单独排除某一个文件或者路径,可以使用如下的方法::
> +
> +GCOV_PROFILE_main.o := n
> +
> +和::
> +
> +GCOV_PROFILE := n
> +
> +此机制仅支持链接到内核镜像或编译为内核模块的文件.
> +
> +
> +相关文件
> +-
> +
> +gcov功能需要在debugfs中创建如下文件:
> +
> +``/sys/kernel/debug/gcov``
> +gcov相关功能的根路径
> +
> +``/sys/kernel/debug/gcov/reset``
> +全局复位文件:向该文件写入数据后会将所有的gcov统计
> +数据清0
> +
> +``/sys/kernel/debug/gcov/path/to/compile/dir/file.gcda``
> +gcov工具可以识别的覆盖率统计数据文件,向该文件写入
> +数据后会将本文件的gcov统计数据清0
> +
> +``/sys/kernel/debug/gcov/path/to/compile/dir/file.gcno``
> +gcov工具需要的软连接文件（指向编译时生成的信息统
> +计文件）,这个文件是在gcc编译时如果配置了选项
>

[PATCH v2] ACPICA: Events: support fixed pcie wake event

2021-04-01 Thread Jianmin Lv

Some chipsets support fixed pcie wake event which is
defined in the PM1 block(related description can be found
in 4.8.3.1.1 PM1 Status Registers, 4.8.3.2.1 PM1 Control
Registers and 5.2.9 Fixed ACPI Description Table (FADT)),
such as LS7A1000 of Loongson company, so we add code to
handle it.

ACPI spec link:
https://uefi.org/sites/default/files/resources/ACPI_6_3_May16.pdf

Signed-off-by: Jianmin Lv 
---
 drivers/acpi/acpica/evevent.c  |  8 ++--
 drivers/acpi/acpica/hwsleep.c  | 12 
 drivers/acpi/acpica/utglobal.c |  4 
 include/acpi/actypes.h |  3 ++-
 4 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/drivers/acpi/acpica/evevent.c b/drivers/acpi/acpica/evevent.c
index 35385148fedb..08ba368beb2d 100644
--- a/drivers/acpi/acpica/evevent.c
+++ b/drivers/acpi/acpica/evevent.c
@@ -185,6 +185,10 @@ u32 acpi_ev_fixed_event_detect(void)
return (int_status);
}
 
+   if (fixed_enable & ACPI_BITMASK_PCIEXP_WAKE_DISABLE)
+   fixed_enable &= ~ACPI_BITMASK_PCIEXP_WAKE_DISABLE;
+   else
+   fixed_enable |= ACPI_BITMASK_PCIEXP_WAKE_DISABLE;
ACPI_DEBUG_PRINT((ACPI_DB_INTERRUPTS,
  "Fixed Event Block: Enable %08X Status %08X\n",
  fixed_enable, fixed_status));
@@ -250,8 +254,8 @@ static u32 acpi_ev_fixed_event_dispatch(u32 event)
if (!acpi_gbl_fixed_event_handlers[event].handler) {
(void)acpi_write_bit_register(acpi_gbl_fixed_event_info[event].
  enable_register_id,
- ACPI_DISABLE_EVENT);
-
+   event == ACPI_EVENT_PCIE_WAKE ?
+   ACPI_ENABLE_EVENT : 
ACPI_DISABLE_EVENT);
ACPI_ERROR((AE_INFO,
"No installed handler for fixed event - %s (%u), 
disabling",
acpi_ut_get_event_name(event), event));
diff --git a/drivers/acpi/acpica/hwsleep.c b/drivers/acpi/acpica/hwsleep.c
index 14baa13bf848..7e7ea4c2e914 100644
--- a/drivers/acpi/acpica/hwsleep.c
+++ b/drivers/acpi/acpica/hwsleep.c
@@ -312,6 +312,18 @@ acpi_status acpi_hw_legacy_wake(u8 sleep_state)
[ACPI_EVENT_SLEEP_BUTTON].
status_register_id, ACPI_CLEAR_STATUS);
 
+   /* Enable pcie wake event if support */
+   if ((acpi_gbl_FADT.flags & ACPI_FADT_PCI_EXPRESS_WAKE)) {
+   (void)
+   acpi_write_bit_register(acpi_gbl_fixed_event_info
+   [ACPI_EVENT_PCIE_WAKE].
+   enable_register_id, ACPI_DISABLE_EVENT);
+   (void)
+   acpi_write_bit_register(acpi_gbl_fixed_event_info
+   [ACPI_EVENT_PCIE_WAKE].
+   status_register_id, ACPI_CLEAR_STATUS);
+   }
+
acpi_hw_execute_sleep_method(METHOD_PATHNAME__SST, ACPI_SST_WORKING);
return_ACPI_STATUS(status);
 }
diff --git a/drivers/acpi/acpica/utglobal.c b/drivers/acpi/acpica/utglobal.c
index 59a48371a7bc..68baf16d8a02 100644
--- a/drivers/acpi/acpica/utglobal.c
+++ b/drivers/acpi/acpica/utglobal.c
@@ -186,6 +186,10 @@ struct acpi_fixed_event_info 
acpi_gbl_fixed_event_info[ACPI_NUM_FIXED_EVENTS] =
ACPI_BITREG_RT_CLOCK_ENABLE,
ACPI_BITMASK_RT_CLOCK_STATUS,
ACPI_BITMASK_RT_CLOCK_ENABLE},
+   /* ACPI_EVENT_PCIE_WAKE */ {ACPI_BITREG_PCIEXP_WAKE_STATUS,
+   ACPI_BITREG_PCIEXP_WAKE_DISABLE,
+   ACPI_BITMASK_PCIEXP_WAKE_STATUS,
+   ACPI_BITMASK_PCIEXP_WAKE_DISABLE},
 };
 #endif /* !ACPI_REDUCED_HARDWARE */
 
diff --git a/include/acpi/actypes.h b/include/acpi/actypes.h
index 92c71dfce0d5..0b6c72033487 100644
--- a/include/acpi/actypes.h
+++ b/include/acpi/actypes.h
@@ -714,7 +714,8 @@ typedef u32 acpi_event_type;
 #define ACPI_EVENT_POWER_BUTTON 2
 #define ACPI_EVENT_SLEEP_BUTTON 3
 #define ACPI_EVENT_RTC  4
-#define ACPI_EVENT_MAX  4
+#define ACPI_EVENT_PCIE_WAKE5
+#define ACPI_EVENT_MAX  5
 #define ACPI_NUM_FIXED_EVENTS   ACPI_EVENT_MAX + 1
 
 /*
-- 
2.27.0

Re: [PATCH v2 1/3] hpsa: use __packed on individual structs, not header-wide

2021-04-01 Thread Martin K. Petersen

On Tue, 30 Mar 2021 08:19:56 +0100, Sergei Trofimovich wrote:

> Some of the structs contain `atomic_t` values and are not intended to be
> sent to IO controller as is.
> 
> The change adds __packed to every struct and union in the file.
> Follow-up commits will fix `atomic_t` problems.
> 
> The commit is a no-op at least on ia64:
> $ diff -u <(objdump -d -r old.o) <(objdump -d -r new.o)

Applied to 5.12/scsi-fixes, thanks!

[1/3] hpsa: use __packed on individual structs, not header-wide
  https://git.kernel.org/mkp/scsi/c/5482a9a1a8fd
[2/3] hpsa: fix boot on ia64 (atomic_t alignment)
  https://git.kernel.org/mkp/scsi/c/02ec144292bc
[3/3] hpsa: add an assert to prevent from __packed reintroduction
  https://git.kernel.org/mkp/scsi/c/e01a00ff62ad

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH 6/6] Fix spelling typo of is

2021-04-01 Thread Martin K. Petersen

On Fri, 26 Mar 2021 11:04:12 +0800, qiumibaoz...@163.com wrote:

> 


Applied to 5.13/scsi-queue, thanks!

[6/6] Fix spelling typo of is
  https://git.kernel.org/mkp/scsi/c/ce0b6e388772

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH] __scsi_remove_device: fix comments minor error

2021-04-01 Thread Martin K. Petersen

On Fri, 26 Mar 2021 14:09:02 +0800, dudengke wrote:

> 

Applied to 5.13/scsi-queue, thanks!

[1/1] __scsi_remove_device: fix comments minor error
  https://git.kernel.org/mkp/scsi/c/eee8910fe0b5

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH -next 2/2] scsi: myrs: Make symbols DAC960_{GEM/BA/LP}_privdata static

2021-04-01 Thread Martin K. Petersen

On Sat, 27 Mar 2021 15:31:57 +0800, Shixin Liu wrote:

> This symbol is not used outside of myrs.c, so we can marks it static.

Applied to 5.13/scsi-queue, thanks!

[2/2] scsi: myrs: Make symbols DAC960_{GEM/BA/LP}_privdata static
  https://git.kernel.org/mkp/scsi/c/e27f3c88e250

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH] scsi: scsi_priv: Remove duplicate declaration

2021-04-01 Thread Martin K. Petersen

On Sat, 27 Mar 2021 11:08:50 +0800, Wan Jiabing wrote:

> struct request and struct request_queue have been
> declared at forward struct declaration.
> Remove the duplicate and reorder the forward declaration
> to be in alphabetic order.

Applied to 5.13/scsi-queue, thanks!

[1/1] scsi: scsi_priv: Remove duplicate declaration
  https://git.kernel.org/mkp/scsi/c/fe515ac82768

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH] include: scsi: scsi_host_cmd_pool is declared twice

2021-04-01 Thread Martin K. Petersen

On Thu, 25 Mar 2021 14:46:31 +0800, Wan Jiabing wrote:

> struct scsi_host_cmd_pool has been declared. Remove the duplicate.

Applied to 5.13/scsi-queue, thanks!

[1/1] include: scsi: scsi_host_cmd_pool is declared twice
  https://git.kernel.org/mkp/scsi/c/6bfe9855daa3

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH -next 1/2] scsi: myrb: Make symbols DAC960_{LA/PG/PD/P}_privdata static

2021-04-01 Thread Martin K. Petersen

On Sat, 27 Mar 2021 15:31:56 +0800, Shixin Liu wrote:

> This symbol is not used outside of myrb.c, so we can marks it static.

Applied to 5.13/scsi-queue, thanks!

[1/2] scsi: myrb: Make symbols DAC960_{LA/PG/PD/P}_privdata static
  https://git.kernel.org/mkp/scsi/c/182ad87c95e7

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH] scsi: qedi: emove redundant assignment to variable err

2021-04-01 Thread Martin K. Petersen

On Sat, 27 Mar 2021 23:06:50 +, Colin King wrote:

> variable err is assigned -ENOMEM followed by an error return path
> via label err_udev that does not access the variable and returns
> with the -ENOMEM error return code. The assignment to err is
> redundant and can be removed.

Applied to 5.13/scsi-queue, thanks!

[1/1] scsi: qedi: emove redundant assignment to variable err
  https://git.kernel.org/mkp/scsi/c/8dc602529681

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH] Fix fnic driver to remove bogus ratelimit messages.

2021-04-01 Thread Martin K. Petersen

On Tue, 23 Mar 2021 10:27:56 -0700, ldun...@suse.com wrote:

> Commit b43abcbbd5b1 ("scsi: fnic: Ratelimit printks to avoid
> looding when vlan is not set by the switch.i") added
> printk_ratelimit() in front of a couple of debug-mode
> messages, to reduce logging overrun when debugging the
> driver. The code:
> 
> >   if (printk_ratelimit())
> >   FNIC_FCS_DBG(KERN_DEBUG, fnic->lport->host,
> > "Start VLAN Discovery\n");
> 
> [...]

Applied to 5.13/scsi-queue, thanks!

[1/1] Fix fnic driver to remove bogus ratelimit messages.
  https://git.kernel.org/mkp/scsi/c/d2478dd25691

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH][next] scsi: a100u2w: remove unused variable biosaddr

2021-04-01 Thread Martin K. Petersen

On Thu, 25 Mar 2021 17:07:31 +, Colin King wrote:

> The variable biosaddr is being assigned a value that is never read,
> the variable is redundant and can be safely removed.

Applied to 5.13/scsi-queue, thanks!

[1/1] scsi: a100u2w: remove unused variable biosaddr
  https://git.kernel.org/mkp/scsi/c/92b4c52c43e1

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH] iommu/vt-d: Force to flush iotlb before creating superpage

2021-04-01 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)

Hi Baolu,

在 2021/4/2 11:06, Lu Baolu 写道:
> Hi Longpeng,
> 
> On 4/1/21 3:18 PM, Longpeng(Mike) wrote:
>> The translation caches may preserve obsolete data when the
>> mapping size is changed, suppose the following sequence which
>> can reveal the problem with high probability.
>>
>> 1.mmap(4GB,MAP_HUGETLB)
>> 2.
>>    while (1) {
>>     (a)    DMA MAP   0,0xa
>>     (b)    DMA UNMAP 0,0xa
>>     (c)    DMA MAP   0,0xc000
>>   * DMA read IOVA 0 may failure here (Not present)
>>   * if the problem occurs.
>>     (d)    DMA UNMAP 0,0xc000
>>    }
>>
>> The page table(only focus on IOVA 0) after (a) is:
>>   PML4: 0x19db5c1003   entry:0x899bdcd2f000
>>    PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
>>     PDE: 0x1a30a72003  entry:0x89b39cacb000
>>  PTE: 0x21d200803  entry:0x89b3b0a72000
>>
>> The page table after (b) is:
>>   PML4: 0x19db5c1003   entry:0x899bdcd2f000
>>    PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
>>     PDE: 0x1a30a72003  entry:0x89b39cacb000
>>  PTE: 0x0  entry:0x89b3b0a72000
>>
>> The page table after (c) is:
>>   PML4: 0x19db5c1003   entry:0x899bdcd2f000
>>    PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
>>     PDE: 0x21d200883   entry:0x89b39cacb000 (*)
>>
>> Because the PDE entry after (b) is present, it won't be
>> flushed even if the iommu driver flush cache when unmap,
>> so the obsolete data may be preserved in cache, which
>> would cause the wrong translation at end.
>>
>> However, we can see the PDE entry is finally switch to
>> 2M-superpage mapping, but it does not transform
>> to 0x21d200883 directly:
>>
>> 1. PDE: 0x1a30a72003
>> 2. __domain_mapping
>>   dma_pte_free_pagetable
>>     Set the PDE entry to ZERO
>>   Set the PDE entry to 0x21d200883
>>
>> So we must flush the cache after the entry switch to ZERO
>> to avoid the obsolete info be preserved.
>>
>> Cc: David Woodhouse 
>> Cc: Lu Baolu 
>> Cc: Nadav Amit 
>> Cc: Alex Williamson 
>> Cc: Kevin Tian 
>> Cc: Gonglei (Arei) 
>>
>> Fixes: 6491d4d02893 ("intel-iommu: Free old page tables before creating
>> superpage")
>> Cc:  # v3.0+
>> Link:
>> https://lore.kernel.org/linux-iommu/670baaf8-4ff8-4e84-4be3-030b95ab5...@huawei.com/
>>
>> Suggested-by: Lu Baolu 
>> Signed-off-by: Longpeng(Mike) 
>> ---
>>   drivers/iommu/intel/iommu.c | 15 +--
>>   1 file changed, 13 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
>> index ee09323..cbcb434 100644
>> --- a/drivers/iommu/intel/iommu.c
>> +++ b/drivers/iommu/intel/iommu.c
>> @@ -2342,9 +2342,20 @@ static inline int hardware_largepage_caps(struct
>> dmar_domain *domain,
>>    * removed to make room for superpage(s).
>>    * We're adding new large pages, so make sure
>>    * we don't remove their parent tables.
>> + *
>> + * We also need to flush the iotlb before creating
>> + * superpage to ensure it does not perserves any
>> + * obsolete info.
>>    */
>> -    dma_pte_free_pagetable(domain, iov_pfn, end_pfn,
>> -   largepage_lvl + 1);
>> +    if (dma_pte_present(pte)) {
>> +    int i;
>> +
>> +    dma_pte_free_pagetable(domain, iov_pfn, end_pfn,
>> +   largepage_lvl + 1);
>> +    for_each_domain_iommu(i, domain)
>> +    iommu_flush_iotlb_psi(g_iommus[i], domain,
>> +  iov_pfn, nr_pages, 0, 0);
> 
> Thanks for patch!
> 
> How about making the flushed page size accurate? For example,
> 
> @@ -2365,8 +2365,8 @@ __domain_mapping(struct dmar_domain *domain, unsigned 
> long
> iov_pfn,
>     dma_pte_free_pagetable(domain, 
> iov_pfn,
> end_pfn,
> 
> largepage_lvl + 1);
>     for_each_domain_iommu(i, domain)
> - iommu_flush_iotlb_psi(g_iommus[i], domain,
> - iov_pfn, nr_pages, 0, 0);
> + iommu_flush_iotlb_psi(g_iommus[i], domain, iov_pfn,
> + ALIGN_DOWN(nr_pages, lvl_pages), 0, 0);
> 
Yes, make sense.

Maybe another alternative is 'end_pfn - iova_pfn + 1', it's readable because we
free pagetable with (iova_pfn, end_pfn) above. Which one do you prefer?

> 
>> +    }
>>   } else {
>>   pteval &= ~(uint64_t)DMA_PTE_LARGE_PAGE;
>>   }
>>
> 
> Best regards,
> baolu
> .

[PATCH] ASoC: max98390: Add controls for tx path

2021-04-01 Thread Steve Lee

 Add controls for tx source.

Signed-off-by: Steve Lee 
---
 sound/soc/codecs/max98390.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/sound/soc/codecs/max98390.c b/sound/soc/codecs/max98390.c
index bb736c44e68a..163093959da8 100644
--- a/sound/soc/codecs/max98390.c
+++ b/sound/soc/codecs/max98390.c
@@ -656,6 +656,12 @@ static const struct snd_kcontrol_new 
max98390_snd_controls[] = {
MAX98390_AMP_DSP_CFG_RMP_DN_SHIFT, 1, 0),
SOC_SINGLE("Boost Clock Phase", MAX98390_BOOST_CTRL3,
MAX98390_BOOST_CLK_PHASE_CFG_SHIFT, 3, 0),
+   SOC_SINGLE("Tx Enable Selection", MAX98390_PCM_TX_EN_A,
+   0, 255, 0),
+   SOC_SINGLE("Tx Hiz Selection", MAX98390_PCM_TX_HIZ_CTRL_A,
+   0, 255, 0),
+   SOC_SINGLE("Tx Source Selection", MAX98390_PCM_CH_SRC_2,
+   0, 255, 0),
SOC_ENUM("Boost Output Voltage", max98390_boost_voltage),
SOC_ENUM("Current Limit", max98390_current_limit),
SOC_SINGLE_EXT("DSM Rdc", SND_SOC_NOPM, 0, 0xff, 0,
-- 
2.17.1

[RFC PATCH v2 1/1] arm64: Implement stack trace termination record

2021-04-01 Thread madvenka

From: "Madhavan T. Venkataraman" 

Reliable stacktracing requires that we identify when a stacktrace is
terminated early. We can do this by ensuring all tasks have a final
frame record at a known location on their task stack, and checking
that this is the final frame record in the chain.

Kernel Tasks


All tasks except the idle task have a pt_regs structure right after the
task stack. This is called the task pt_regs. The pt_regs structure has a
special stackframe field. Make this stackframe field the final frame in the
task stack. This needs to be done in copy_thread() which initializes a new
task's pt_regs and initial CPU context.

For the idle task, there is no task pt_regs. For our purpose, we need one.
So, create a pt_regs just like other kernel tasks and make
pt_regs->stackframe the final frame in the idle task stack. This needs to be
done at two places:

- On the primary CPU, the boot task runs. It calls start_kernel()
  and eventually becomes the idle task for the primary CPU. Just
  before start_kernel() is called, set up the final frame.

- On each secondary CPU, a startup task runs that calls
  secondary_startup_kernel() and eventually becomes the idle task
  on the secondary CPU. Just before secondary_start_kernel() is
  called, set up the final frame.

User Tasks
==

User tasks are initially set up like kernel tasks when they are created.
Then, they return to userland after fork via ret_from_fork(). After that,
they enter the kernel only on an EL0 exception. (In arm64, system calls are
also EL0 exceptions). The EL0 exception handler stores state in the task
pt_regs and calls different functions based on the type of exception. The
stack trace for an EL0 exception must end at the task pt_regs. So, make
task pt_regs->stackframe as the final frame in the EL0 exception stack.

In summary, task pt_regs->stackframe is where a successful stack trace ends.

Stack trace termination
===

In the unwinder, terminate the stack trace successfully when
task_pt_regs(task)->stackframe is reached. For stack traces in the kernel,
this will correctly terminate the stack trace at the right place.

However, debuggers terminate the stack trace when FP == 0. In the
pt_regs->stackframe, the PC is 0 as well. So, stack traces taken in the
debugger may print an extra record 0x0 at the end. While this is not
pretty, this does not do any harm. This is a small price to pay for
having reliable stack trace termination in the kernel.

Signed-off-by: Madhavan T. Venkataraman 
---
 arch/arm64/kernel/entry.S  |  8 +---
 arch/arm64/kernel/head.S   | 29 +++--
 arch/arm64/kernel/process.c|  5 +
 arch/arm64/kernel/stacktrace.c | 10 +-
 4 files changed, 38 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index a31a0a713c85..e2dc2e998934 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -261,16 +261,18 @@ alternative_else_nop_endif
stp lr, x21, [sp, #S_LR]
 
/*
-* For exceptions from EL0, terminate the callchain here.
+* For exceptions from EL0, terminate the callchain here at
+* task_pt_regs(current)->stackframe.
+*
 * For exceptions from EL1, create a synthetic frame record so the
 * interrupted code shows up in the backtrace.
 */
.if \el == 0
-   mov x29, xzr
+   stp xzr, xzr, [sp, #S_STACKFRAME]
.else
stp x29, x22, [sp, #S_STACKFRAME]
-   add x29, sp, #S_STACKFRAME
.endif
+   add x29, sp, #S_STACKFRAME
 
 #ifdef CONFIG_ARM64_SW_TTBR0_PAN
 alternative_if_not ARM64_HAS_PAN
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 840bda1869e9..743c019a42c7 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -393,6 +393,23 @@ SYM_FUNC_START_LOCAL(__create_page_tables)
ret x28
 SYM_FUNC_END(__create_page_tables)
 
+   /*
+* The boot task becomes the idle task for the primary CPU. The
+* CPU startup task on each secondary CPU becomes the idle task
+* for the secondary CPU.
+*
+* The idle task does not require pt_regs. But create a dummy
+* pt_regs so that task_pt_regs(idle_task)->stackframe can be
+* set up to be the final frame on the idle task stack just like
+* all the other kernel tasks. This helps the unwinder to
+* terminate the stack trace at a well-known stack offset.
+*/
+   .macro setup_final_frame
+   sub sp, sp, #PT_REGS_SIZE
+   stp xzr, xzr, [sp, #S_STACKFRAME]
+   add x29, sp, #S_STACKFRAME
+   .endm
+
 /*
  * The following fragment of code is executed with the MMU enabled.
  *
@@ -447,9 +464,9 @@ SYM_FUNC_START_LOCAL(__primary_switched)
 #endif
bl  switch_to_vhe   // Prefer

[RFC PATCH v2 0/1] arm64: Implement stack trace termination record

2021-04-01 Thread madvenka

From: "Madhavan T. Venkataraman" 

Reliable stacktracing requires that we identify when a stacktrace is
terminated early. We can do this by ensuring all tasks have a final
frame record at a known location on their task stack, and checking
that this is the final frame record in the chain.

All tasks have a pt_regs structure right after the task stack in the stack
page. The pt_regs structure contains a stackframe field. Make this stackframe
field the final frame in the task stack so all stack traces end at a fixed
stack offset.

For kernel tasks, this is simple to understand. For user tasks, there is
some extra detail. User tasks get created via fork() et al. Once they return
from fork, they enter the kernel only on an EL0 exception. In arm64,
system calls are also EL0 exceptions.

The EL0 exception handler uses the task pt_regs mentioned above to save
register state and call different exception functions. All stack traces
from EL0 exception code must end at the pt_regs. So, make pt_regs->stackframe
the final frame in the EL0 exception stack.

To summarize, task_pt_regs(task)->stackframe will always be the final frame
in a stack trace.

Sample stack traces
===

The final frame for the idle tasks is different from v1. The rest of the
stack traces are the same.

Primary CPU's idle task (changed from v1)
===

[0.022365]   arch_stack_walk+0x0/0xd0
[0.022376]   callfd_stack+0x30/0x60
[0.022387]   rest_init+0xd8/0xf8
[0.022397]   arch_call_rest_init+0x18/0x24
[0.022411]   start_kernel+0x5b8/0x5f4
[0.022424]   __primary_switched+0xa8/0xac

Secondary CPU's idle task (changed from v1)
=

[0.022484]   arch_stack_walk+0x0/0xd0
[0.022494]   callfd_stack+0x30/0x60
[0.022502]   secondary_start_kernel+0x188/0x1e0
[0.022513]   __secondary_switched+0x80/0x84

---
Changelog:

v1
- Set up task_pt_regs(current)->stackframe as the final frame
  when a new task is initialized in copy_thread().

- Create pt_regs for the idle tasks and set up pt_regs->stackframe
  as the final frame for the idle tasks.

- Set up task_pt_regs(current)->stackframe as the final frame in
  the EL0 exception handler so the EL0 exception stack trace ends
  there.

- Terminate the stack trace successfully in unwind_frame() when
  the FP reaches task_pt_regs(current)->stackframe.

- The stack traces (above) in the kernel will terminate at the
  correct place. Debuggers may show an extra record 0x0 at the end
  for pt_regs->stackframe. That said, I did not see that extra frame
  when I did stack traces using gdb.
v2
- Changed some wordings as suggested by Mark Rutland.

- Removed the synthetic return PC for idle tasks. Changed the
  branches to start_kernel() and secondary_start_kernel() to
  calls so that they will have a proper return PC.

Madhavan T. Venkataraman (1):
  arm64: Implement stack trace termination record

 arch/arm64/kernel/entry.S  |  8 +---
 arch/arm64/kernel/head.S   | 29 +++--
 arch/arm64/kernel/process.c|  5 +
 arch/arm64/kernel/stacktrace.c | 10 +-
 4 files changed, 38 insertions(+), 14 deletions(-)


base-commit: 0d02ec6b3136c73c09e7859f0d0e4e2c4c07b49b
-- 
2.25.1

[PATCH] drm/amdgpu: Fix a potential sdma invalid access

2021-04-01 Thread Qu Huang

Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(),
the bo->base.resv lock may be held by ttm_mem_evict_first(),
and the VRAM mem will be evicted, mem region was replaced
by Gtt mem region. amdgpu_bo_release_notify() will then
hold the bo->base.resv lock, and SDMA will get an invalid
address in amdgpu_fill_buffer(), resulting in a VMFAULT
or memory corruption.

To avoid it, we have to hold bo->base.resv lock first, and
check whether the mem.mem_type is TTM_PL_VRAM.

Signed-off-by: Qu Huang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 4b29b82..8018574 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object 
*bo)
if (bo->base.resv == >base._resv)
amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);

-   if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
-   !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
+   if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
return;

dma_resv_lock(bo->base.resv, NULL);

+   if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
+   dma_resv_unlock(bo->base.resv);
+   return;
+   }
+
r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, );
if (!WARN_ON(r)) {
amdgpu_bo_fence(abo, fence, false);
--
1.8.3.1

Re: [PATCH] iommu/vt-d: Force to flush iotlb before creating superpage

2021-04-01 Thread Lu Baolu


Hi Longpeng,

On 4/1/21 3:18 PM, Longpeng(Mike) wrote:

The translation caches may preserve obsolete data when the
mapping size is changed, suppose the following sequence which
can reveal the problem with high probability.

1.mmap(4GB,MAP_HUGETLB)
2.
   while (1) {
(a)DMA MAP   0,0xa
(b)DMA UNMAP 0,0xa
(c)DMA MAP   0,0xc000
  * DMA read IOVA 0 may failure here (Not present)
  * if the problem occurs.
(d)DMA UNMAP 0,0xc000
   }

The page table(only focus on IOVA 0) after (a) is:
  PML4: 0x19db5c1003   entry:0x899bdcd2f000
   PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
PDE: 0x1a30a72003  entry:0x89b39cacb000
 PTE: 0x21d200803  entry:0x89b3b0a72000

The page table after (b) is:
  PML4: 0x19db5c1003   entry:0x899bdcd2f000
   PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
PDE: 0x1a30a72003  entry:0x89b39cacb000
 PTE: 0x0  entry:0x89b3b0a72000

The page table after (c) is:
  PML4: 0x19db5c1003   entry:0x899bdcd2f000
   PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
PDE: 0x21d200883   entry:0x89b39cacb000 (*)

Because the PDE entry after (b) is present, it won't be
flushed even if the iommu driver flush cache when unmap,
so the obsolete data may be preserved in cache, which
would cause the wrong translation at end.

However, we can see the PDE entry is finally switch to
2M-superpage mapping, but it does not transform
to 0x21d200883 directly:

1. PDE: 0x1a30a72003
2. __domain_mapping
  dma_pte_free_pagetable
Set the PDE entry to ZERO
  Set the PDE entry to 0x21d200883

So we must flush the cache after the entry switch to ZERO
to avoid the obsolete info be preserved.

Cc: David Woodhouse 
Cc: Lu Baolu 
Cc: Nadav Amit 
Cc: Alex Williamson 
Cc: Kevin Tian 
Cc: Gonglei (Arei) 

Fixes: 6491d4d02893 ("intel-iommu: Free old page tables before creating 
superpage")
Cc:  # v3.0+
Link: 
https://lore.kernel.org/linux-iommu/670baaf8-4ff8-4e84-4be3-030b95ab5...@huawei.com/
Suggested-by: Lu Baolu 
Signed-off-by: Longpeng(Mike) 
---
  drivers/iommu/intel/iommu.c | 15 +--
  1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ee09323..cbcb434 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2342,9 +2342,20 @@ static inline int hardware_largepage_caps(struct 
dmar_domain *domain,
 * removed to make room for superpage(s).
 * We're adding new large pages, so make sure
 * we don't remove their parent tables.
+*
+* We also need to flush the iotlb before 
creating
+* superpage to ensure it does not perserves any
+* obsolete info.
 */
-   dma_pte_free_pagetable(domain, iov_pfn, end_pfn,
-  largepage_lvl + 1);
+   if (dma_pte_present(pte)) {
+   int i;
+
+   dma_pte_free_pagetable(domain, iov_pfn, 
end_pfn,
+  largepage_lvl + 
1);
+   for_each_domain_iommu(i, domain)
+   
iommu_flush_iotlb_psi(g_iommus[i], domain,
+ iov_pfn, 
nr_pages, 0, 0);


Thanks for patch!

How about making the flushed page size accurate? For example,

@@ -2365,8 +2365,8 @@ __domain_mapping(struct dmar_domain *domain, 
unsigned long iov_pfn,
dma_pte_free_pagetable(domain, 
iov_pfn, end_pfn,


largepage_lvl + 1);
for_each_domain_iommu(i, domain)
- 
iommu_flush_iotlb_psi(g_iommus[i], domain,
- 
iov_pfn, nr_pages, 0, 0);
+ 
iommu_flush_iotlb_psi(g_iommus[i], domain, iov_pfn,
+ 
ALIGN_DOWN(nr_pages, lvl_pages), 0, 0);




+   }
} else {
pteval &= ~(uint64_t)DMA_PTE_LARGE_PAGE;
}



Best regards,
baolu

Re: [External] Re: [RFC PATCH 00/15] Use obj_cgroup APIs to charge the LRU pages

2021-04-01 Thread Muchun Song

On Fri, Apr 2, 2021 at 1:15 AM Shakeel Butt  wrote:
>
> On Thu, Apr 1, 2021 at 9:08 AM Muchun Song  wrote:
> >
> [...]
> > > The zombie issue is a pretty urgent concern that has caused several
> > > production emergencies now. It needs a fix sooner rather than later.
> >
> > Thank you very much for clarifying the problem for me. I do agree
> > with you. This issue should be fixed ASAP. Using objcg is a good
> > choice. Dying objcg should not be a problem. Because the size of
> > objcg is so small compared to memcg.
> >
>
> Just wanted to say out loud that yes this patchset will reduce the
> memcg zombie issue but this is not the final destination. We should
> continue the discussions on sharing/reusing scenarios.

Yeah. Reducing the zombie memcg is not the final destination.
But it is an optimization. OK. The discussions about sharing/reusing
is also welcome.

>
> Muchun, can you please also CC Hugh Dickins and Alex Shi in the next
> version of your patchset?

No problem. I will cc Alex Shi in the next version.

[PATCH] apply: use DEFINE_SPINLOCK() instead of spin_lock_init().

2021-04-01 Thread Yu Jiahua

From: Jiahua Yu 

spinlock can be initialized automatically with DEFINE_SPINLOCK()
rather than explicitly calling spin_lock_init().

Signed-off-by: Jiahua Yu 
---
 drivers/video/fbdev/omap2/omapfb/dss/apply.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/video/fbdev/omap2/omapfb/dss/apply.c 
b/drivers/video/fbdev/omap2/omapfb/dss/apply.c
index c71021091828..acca991c7540 100644
--- a/drivers/video/fbdev/omap2/omapfb/dss/apply.c
+++ b/drivers/video/fbdev/omap2/omapfb/dss/apply.c
@@ -108,7 +108,7 @@ static struct {
 } dss_data;
 
 /* protects dss_data */
-static spinlock_t data_lock;
+static DEFINE_SPINLOCK(data_lock);
 /* lock for blocking functions */
 static DEFINE_MUTEX(apply_lock);
 static DECLARE_COMPLETION(extra_updated_completion);
@@ -131,8 +131,6 @@ static void apply_init_priv(void)
struct mgr_priv_data *mp;
int i;
 
-   spin_lock_init(_lock);
-
for (i = 0; i < num_ovls; ++i) {
struct ovl_priv_data *op;
 
-- 
2.17.1

Re: BUG_ON(!mapping_empty(>i_data))

2021-04-01 Thread Matthew Wilcox

On Thu, Apr 01, 2021 at 06:06:15PM +0100, Matthew Wilcox wrote:
> On Wed, Mar 31, 2021 at 02:58:12PM -0700, Hugh Dickins wrote:
> > I suspect there's a bug in the XArray handling in collapse_file(),
> > which sometimes leaves empty nodes behind.
> 
> Urp, yes, that can easily happen.
> 
> /* This will be less messy when we use multi-index entries */
> do {
> xas_lock_irq();
> xas_create_range();
> if (!xas_error())
> break;
> if (!xas_nomem(, GFP_KERNEL)) {
> result = SCAN_FAIL;
> goto out;
> }
> 
> xas_create_range() can absolutely create nodes with zero entries.
> So if we create m/n nodes and then it runs out of memory (or cgroup
> denies it), we can leave nodes in the tree with zero entries.
> 
> There are three options for fixing it ...
>  - Switch to using multi-index entries.  We need to do this anyway, but
>I don't yet have a handle on the bugs that you found last time I
>pushed this into linux-next.  At -rc5 seems like a late stage to be
>trying this solution.
>  - Add an xas_prune_range() that gets called on failure.  Should be
>straightforward to write, but will be obsolete as soon as we do the
>above and it's a pain for the callers.
>  - Change how xas_create_range() works to merely preallocate the xa_nodes
>and not insert them into the tree until we're trying to insert data into
>them.  I favour this option, and this scenario is amenable to writing
>a test that will simulate failure halfway through.
> 
> I'm going to start on option 3 now.

option 3 didn't work out terribly well.  So here's option 4; if we fail
to allocate memory when creating a node, prune the tree.  This fixes
(I think) the problem inherited from the radix tree, although the test
case is only for xas_create_range().  I should add a couple of test cases
for xas_create() failing, but I just got this to pass and I wanted to
send it out as soon as possible.

diff --git a/lib/test_xarray.c b/lib/test_xarray.c
index 8b1c318189ce..84c6057932f3 100644
--- a/lib/test_xarray.c
+++ b/lib/test_xarray.c
@@ -1463,6 +1463,30 @@ static noinline void check_create_range_4(struct xarray 
*xa,
XA_BUG_ON(xa, !xa_empty(xa));
 }
 
+static noinline void check_create_range_5(struct xarray *xa,
+   unsigned long index, unsigned order)
+{
+   XA_STATE_ORDER(xas, xa, index, order);
+   int i = 0;
+   gfp_t gfp = GFP_KERNEL;
+
+   XA_BUG_ON(xa, !xa_empty(xa));
+
+   do {
+   xas_lock();
+   xas_create_range();
+   xas_unlock();
+   if (++i == 4)
+   gfp = GFP_NOWAIT;
+   } while (xas_nomem(, gfp));
+
+   if (!xas_error())
+   xa_destroy(xa);
+
+   XA_BUG_ON(xa, xas.xa_alloc);
+   XA_BUG_ON(xa, !xa_empty(xa));
+}
+
 static noinline void check_create_range(struct xarray *xa)
 {
unsigned int order;
@@ -1490,6 +1514,12 @@ static noinline void check_create_range(struct xarray 
*xa)
check_create_range_4(xa, (3U << order) + 1, order);
check_create_range_4(xa, (3U << order) - 1, order);
check_create_range_4(xa, (1U << 24) + 1, order);
+
+   check_create_range_5(xa, 0, order);
+   check_create_range_5(xa, (1U << order), order);
+   check_create_range_5(xa, (2U << order), order);
+   check_create_range_5(xa, (3U << order), order);
+   check_create_range_5(xa, (1U << (2 * order)), order);
}
 
check_create_range_3();
diff --git a/lib/xarray.c b/lib/xarray.c
index f5d8f54907b4..923ccde6379e 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -276,77 +276,6 @@ static void xas_destroy(struct xa_state *xas)
}
 }
 
-/**
- * xas_nomem() - Allocate memory if needed.
- * @xas: XArray operation state.
- * @gfp: Memory allocation flags.
- *
- * If we need to add new nodes to the XArray, we try to allocate memory
- * with GFP_NOWAIT while holding the lock, which will usually succeed.
- * If it fails, @xas is flagged as needing memory to continue.  The caller
- * should drop the lock and call xas_nomem().  If xas_nomem() succeeds,
- * the caller should retry the operation.
- *
- * Forward progress is guaranteed as one node is allocated here and
- * stored in the xa_state where it will be found by xas_alloc().  More
- * nodes will likely be found in the slab allocator, but we do not tie
- * them up here.
- *
- * Return: true if memory was needed, and was successfully allocated.
- */
-bool xas_nomem(struct xa_state *xas, gfp_t gfp)
-{
-   if (xas->xa_node != XA_ERROR(-ENOMEM)) {
-   xas_destroy(xas);
-   return false;
-   }
-   if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
-   gfp |= __GFP_ACCOUNT;
-   xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
-

[PATCH v2] psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files

2021-04-01 Thread Josh Hunt

Currently only root can write files under /proc/pressure. Relax this to
allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be
able to write to these files.

Signed-off-by: Josh Hunt 
Acked-by: Johannes Weiner 
---
 kernel/sched/psi.c | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index b1b00e9bd7ed..d1212f17a898 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1061,19 +1061,27 @@ static int psi_cpu_show(struct seq_file *m, void *v)
return psi_show(m, _system, PSI_CPU);
 }
 
+static int psi_open(struct file *file, int (*psi_show)(struct seq_file *, void 
*))
+{
+   if (file->f_mode & FMODE_WRITE && !capable(CAP_SYS_RESOURCE))
+   return -EPERM;
+
+   return single_open(file, psi_show, NULL);
+}
+
 static int psi_io_open(struct inode *inode, struct file *file)
 {
-   return single_open(file, psi_io_show, NULL);
+   return psi_open(file, psi_io_show);
 }
 
 static int psi_memory_open(struct inode *inode, struct file *file)
 {
-   return single_open(file, psi_memory_show, NULL);
+   return psi_open(file, psi_memory_show);
 }
 
 static int psi_cpu_open(struct inode *inode, struct file *file)
 {
-   return single_open(file, psi_cpu_show, NULL);
+   return psi_open(file, psi_cpu_show);
 }
 
 struct psi_trigger *psi_trigger_create(struct psi_group *group,
@@ -1353,9 +1361,9 @@ static int __init psi_proc_init(void)
 {
if (psi_enable) {
proc_mkdir("pressure", NULL);
-   proc_create("pressure/io", 0, NULL, _io_proc_ops);
-   proc_create("pressure/memory", 0, NULL, _memory_proc_ops);
-   proc_create("pressure/cpu", 0, NULL, _cpu_proc_ops);
+   proc_create("pressure/io", 0666, NULL, _io_proc_ops);
+   proc_create("pressure/memory", 0666, NULL, 
_memory_proc_ops);
+   proc_create("pressure/cpu", 0666, NULL, _cpu_proc_ops);
}
return 0;
 }
-- 
2.17.1

Re: [PATCH v9 22/22] uvc: use vb2 ioctl and fop helpers

2021-04-01 Thread Tomasz Figa

Hi Ricardo,

On Fri, Mar 26, 2021 at 7:00 PM Ricardo Ribalda  wrote:
>
> From: Hans Verkuil 
>
> When uvc was written the vb2 ioctl and file operation helpers didn't exist.
>
> This patch switches uvc over to those helpers, which removes a lot of 
> boilerplate
> code and simplifies VIDIOC_G/S_PRIORITY handling and allows us to drop the
> 'privileges' scheme, since that's now handled inside the vb2 helpers.
>
> This makes it possible for uvc to pass the v4l2-compliance streaming tests.
>
> Signed-off-by: Hans Verkuil 

Thanks for the patch. Did you perhaps miss adding your sign-off?

Also, see my comments inline.

[snip]
> @@ -1166,11 +969,6 @@ static int uvc_ioctl_s_parm(struct file *file, void *fh,
>  {
> struct uvc_fh *handle = fh;
> struct uvc_streaming *stream = handle->stream;
> -   int ret;
> -
> -   ret = uvc_acquire_privileges(handle);
> -   if (ret < 0)
> -   return ret;

Why is it okay not to acquire the privileges here?

>
> return uvc_v4l2_set_streamparm(stream, parm);
>  }

Best regards,
Tomasz

[PATCH] crypto:hisilicon/sec - fixup checking the 3DES weak key

2021-04-01 Thread Kai Ye

skcipher: Add a verifying to check whether the triple DES key
is weak.

Signed-off-by: Kai Ye 
---
 drivers/crypto/hisilicon/sec2/sec_crypto.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/crypto/hisilicon/sec2/sec_crypto.c 
b/drivers/crypto/hisilicon/sec2/sec_crypto.c
index 2eaa516..ee18c88 100644
--- a/drivers/crypto/hisilicon/sec2/sec_crypto.c
+++ b/drivers/crypto/hisilicon/sec2/sec_crypto.c
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -573,10 +574,18 @@ static void sec_skcipher_uninit(struct crypto_skcipher 
*tfm)
sec_ctx_base_uninit(ctx);
 }
 
-static int sec_skcipher_3des_setkey(struct sec_cipher_ctx *c_ctx,
+static int sec_skcipher_3des_setkey(struct crypto_skcipher *tfm, const u8 *key,
const u32 keylen,
const enum sec_cmode c_mode)
 {
+   struct sec_ctx *ctx = crypto_skcipher_ctx(tfm);
+   struct sec_cipher_ctx *c_ctx = >c_ctx;
+   int ret;
+
+   ret = verify_skcipher_des3_key(tfm, key);
+   if (ret)
+   return ret;
+
switch (keylen) {
case SEC_DES3_2KEY_SIZE:
c_ctx->c_key_len = SEC_CKEY_3DES_2KEY;
@@ -648,7 +657,7 @@ static int sec_skcipher_setkey(struct crypto_skcipher *tfm, 
const u8 *key,
 
switch (c_alg) {
case SEC_CALG_3DES:
-   ret = sec_skcipher_3des_setkey(c_ctx, keylen, c_mode);
+   ret = sec_skcipher_3des_setkey(tfm, key, keylen, c_mode);
break;
case SEC_CALG_AES:
case SEC_CALG_SM4:
-- 
2.8.1

Re: [RFC v1 00/26] Add TDX Guest Support

2021-04-01 Thread Andi Kleen

> I've heard things like "we need to harden the drivers" or "we need to do
> audits" and that drivers might be "whitelisted".

The basic driver allow listing patches are already in the repository,
but not currently posted or complete:

https://github.com/intel/tdx/commits/guest

> 
> What are we talking about specifically?  Which drivers?  How many
> approximately?  Just virtio?  

Right now just virtio, later other drivers that hypervisors need.

> Are there any "real" hardware drivers
> involved like how QEMU emulates an e1000 or rtl8139 device?  

Not currently (but some later hypervisor might rely on one of those)

> What about
> the APIC or HPET?

No IO-APIC, but the local APIC. No HPET.

> 
> How broadly across the kernel is this going to go?

Not very broadly for drivers.

> 
> Without something concrete, it's really hard to figure out if we should
> go full-blown paravirtualized MMIO, or do something like the #VE
> trapping that's in this series currently.

As Sean says the concern about MMIO is less drivers (which should
be generally ok if they work on other architectures which require MMIO
magic), but other odd code that only ran on x86 before.

I really don't understand your crusade against #VE. It really
isn't that bad if we can avoid the few corner cases.

For me it would seem wrong to force all MMIO for all drivers to some
complicated paravirt construct, blowing up code side everywhere
and adding complicated self modifying code, when it's only needed for very
few drivers. But we also don't want to patch every MMIO to be special cased
even those few drivers.

#VE based MMIO avoids all that cleanly while being nicely non intrusive.

-Andi

Re: Race condition in Kernel

2021-04-01 Thread Ming Lei

On Thu, Apr 01, 2021 at 04:27:37PM +, Gulam Mohamed wrote:
> Hi Ming,
> 
>   Thanks for taking a look into this. Can you please see my inline 
> comments in below mail?
> 
> Regards,
> Gulam Mohamed.
> 
> -Original Message-
> From: Ming Lei  
> Sent: Thursday, March 25, 2021 7:16 AM
> To: Gulam Mohamed 
> Cc: h...@infradead.org; linux-kernel@vger.kernel.org; 
> linux-bl...@vger.kernel.org; Junxiao Bi ; Martin 
> Petersen ; ax...@kernel.dk
> Subject: Re: Race condition in Kernel
> 
> On Wed, Mar 24, 2021 at 12:37:03PM +, Gulam Mohamed wrote:
> > Hi All,
> > 
> > We are facing a stale link (of the device) issue during the iscsi-logout 
> > process if we use parted command just before the iscsi logout. Here are the 
> > details:
> >  
> > As part of iscsi logout, the partitions and the disk will be removed. The 
> > parted command, used to list the partitions, will open the disk in RW mode 
> > which results in systemd-udevd re-reading the partitions. This will trigger 
> > the rescan partitions which will also delete and re-add the partitions. So, 
> > both iscsi logout processing and the parted (through systemd-udevd) will be 
> > involved in add/delete of partitions. In our case, the following sequence 
> > of operations happened (the iscsi device is /dev/sdb with partition sdb1):
> > 
> > 1. sdb1 was removed by PARTED
> > 2. kworker, as part of iscsi logout, couldn't remove sdb1 as it was 
> > already removed by PARTED
> > 3. sdb1 was added by parted
> 
> After kworker is started for logout, I guess all IOs are supposed to be 
> failed at that time, so just wondering why 'sdb1' is still added by 
> parted(systemd-udev)? 
> ioctl(BLKRRPART) needs to read partition table for adding back partitions, if 
> IOs are failed by iscsi logout, I guess the issue can be avoided too?
> 
> [GULAM]: Yes, the ioctl(BLKRRPART) reads the partition table for adding back 
> the partitions. I kept a printk in the code just after the partition table is 
> read. Noticed that the partition table was read before the iscsi-logout 
> kworker started the logout processing.

OK, I guess I understood your issue now, what you want is to not allow
to add partitions since step 1, so can you remove disk just at the
beginning of 2) if it is possible? then step 1) isn't needed any more

For your issue, my patch of 'not drop partitions if partition table
isn't changed' can't fix your issue completely since new real partition
still may come from parted during the series.


Thanks,
Ming

Re: [PATCH v6 1/5] dt-bindings:drm/bridge:anx7625:add vendor define flags

2021-04-01 Thread Xin Ji

On Thu, Apr 01, 2021 at 02:33:47PM +0200, Robert Foss wrote:
> Hey Xin,
> 
> This series no longer applies to drm-misc/drm-misc-next, please rebase it.
Hi Robert Foss, OK, I'll rebase it on the drm-misc-next after confirmed
HDCP patch with Sean Paul.
Thanks,
Xin
> 
> On Wed, 24 Mar 2021 at 08:52, Xin Ji  wrote:
> >
> > On Sun, Mar 21, 2021 at 02:00:38PM +0200, Laurent Pinchart wrote:
> > > Hi Xin,
> > >
> > > Thank you for the patch.
> > >
> > > On Fri, Mar 19, 2021 at 02:32:39PM +0800, Xin Ji wrote:
> > > > Add 'bus-type' and 'data-lanes' define for port0. Define DP tx lane0,
> > > > lane1 swing register array define, and audio enable flag.
> > > >
> > > > Signed-off-by: Xin Ji 
> > > > ---
> > > >  .../display/bridge/analogix,anx7625.yaml  | 58 ++-
> > > >  1 file changed, 57 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git 
> > > > a/Documentation/devicetree/bindings/display/bridge/analogix,anx7625.yaml
> > > >  
> > > > b/Documentation/devicetree/bindings/display/bridge/analogix,anx7625.yaml
> > > > index c789784efe30..3f54d5876982 100644
> > > > --- 
> > > > a/Documentation/devicetree/bindings/display/bridge/analogix,anx7625.yaml
> > > > +++ 
> > > > b/Documentation/devicetree/bindings/display/bridge/analogix,anx7625.yaml
> > > > @@ -34,6 +34,26 @@ properties:
> > > >  description: used for reset chip control, RESET_N pin B7.
> > > >  maxItems: 1
> > > >
> > > > +  analogix,lane0-swing:
> > > > +$ref: /schemas/types.yaml#/definitions/uint32-array
> > > > +minItems: 1
> > > > +maxItems: 20
> > > > +description:
> > > > +  an array of swing register setting for DP tx lane0 PHY, please 
> > > > don't
> > > > +  add this property, or contact vendor.
> > >
> > > DT properties need to be documented. Contacting the vendor doesn't count
> > > as documentation I'm afraid.
> >
> > Hi Laurent Pinchart, thanks for your comment. For the DP phy swing
> > setting, it is hard to describe in here, needs to refer the anx7625
> > datasheet and programming guide. Basically, no need to change the DP phy
> > swing setting.
> >
> 
> Laurent is right. But if the value practically is a constant, you can
> move the swing register into the driver. It should still be documented
> as well as possible, but we can be a little bit more flexible.
> 
> > > > @@ -73,6 +123,10 @@ examples:
> > > >  enable-gpios = < 45 GPIO_ACTIVE_HIGH>;
> > > >  reset-gpios = < 73 GPIO_ACTIVE_HIGH>;
> > > >
> > > > +analogix,audio-enable;
> > > > +analogix,lane0-swing = <0x14 0x54 0x64 0x74 0x29 0x7b 0x77 
> > > > 0x5b>;
> > > > +analogix,lane1-swing = <0x14 0x54 0x64 0x74 0x29 0x7b 0x77 
> > > > 0x5b>;
> > > > +
> > > >  ports {
> > > >  #address-cells = <1>;
> > > >  #size-cells = <0>;

Re: [PATCH v6 4/5] drm/bridge: anx7625: add HDCP support

2021-04-01 Thread Xin Ji

On Mon, Mar 29, 2021 at 02:02:08PM -0400, Sean Paul wrote:
> On Mon, Mar 29, 2021 at 6:27 AM Xin Ji  wrote:
> >
> > On Thu, Mar 25, 2021 at 02:19:23PM -0400, Sean Paul wrote:
> > > On Fri, Mar 19, 2021 at 2:35 AM Xin Ji  wrote:
> > > >
> > > > Add HDCP feature, enable HDCP function through chip internal key
> > > > and downstream's capability.
> > > >
> > > > Signed-off-by: Xin Ji 
> > > > ---
> 
> /snip
> 
> > > >  static void anx7625_dp_start(struct anx7625_data *ctx)
> > > >  {
> > > > int ret;
> > > > @@ -643,6 +787,9 @@ static void anx7625_dp_start(struct anx7625_data 
> > > > *ctx)
> > > > return;
> > > > }
> > > >
> > > > +   /* HDCP config */
> > > > +   anx7625_hdcp_setting(ctx);
> > >
> > > You should really use the "Content Protection" property to
> > > enable/disable HDCP instead of force-enabling it at all times.
> > >
> > > Sean
> > Hi Sean, it's hard to implement "Content Protection" property, we have
> > implemented HDCP in firmware, it is not compatible with it. We don't
> > have interface to get Downstream Cert.
> > Thanks,
> > Xin
> 
> Hi Xin,
> I'm sorry, I don't understand what you mean when you say you don't
> have an interface to get Downstream Cert.
> 
> The Content Protection property is just a means through which
> userspace can turn on and turn off HDCP when it needs. As far as I can
> tell, your patch turns on HDCP when the display is enabled and leaves
> it on until it is disabled. This is undesirable since it forces HDCP
> on the user.
> 
> Is it impossible to enable/disable HDCP outside of display
> enable/disable on your hardware?
> 
> Thanks,
> 
> Sean
Hi Sean, I have commit a test patch on google review site, can you
please help to review it? I'll use Connector's ".atomic_check()"
interface to detect Content Protection property change.
(https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2674580)
Thanks,
Xin
> 
> > >
> > > > +
> > > > if (ctx->pdata.is_dpi)
> > > > ret = anx7625_dpi_config(ctx);
> > > > else
> 
> /snip

Re: [PATCH printk v2 2/5] printk: remove safe buffers

2021-04-01 Thread Sergey Senozhatsky

On (21/04/01 16:17), Petr Mladek wrote:
> > For the long term, we should introduce a printk-context API that allows
> > callers to perfectly pack their multi-line output into a single
> > entry. We discussed [0][1] this back in August 2020.
> 
> We need a "short" term solution. There are currently 3 solutions:
> 
> 1. Keep nmi_safe() and all the hacks around.
> 
> 2. Serialize nmi_cpu_backtrace() by a spin lock and later by
>the special lock used also by atomic consoles.
> 
> 3. Tell complaining people how to sort the messed logs.

Are we talking about nmi_cpu_backtrace()->dump_stack() or some
other path?

dump_stack() seems to be already serialized by `dump_lock`. Hmm,
show_regs() is not serialized, seems like it should be under the
same `dump_lock` as dump_stack().

Re: [PATCH printk v2 1/5] printk: track/limit recursion

2021-04-01 Thread Sergey Senozhatsky

On (21/04/01 12:00), Petr Mladek wrote:
> On Tue 2021-03-30 17:35:08, John Ogness wrote:
> > Currently the printk safe buffers provide a form of recursion
> > protection by redirecting to the safe buffers whenever printk() is
> > recursively called.
> > 
> > In preparation for removal of the safe buffers, provide an alternate
> > explicit recursion protection. Recursion is limited to 3 levels
> > per-CPU and per-context.
> > 
> > Signed-off-by: John Ogness 
> 
> Reviewed-by: Petr Mladek 

Reviewed-by: Sergey Senozhatsky

Re: [PATCH v2 3/3] bindings: ipmi: Add binding for Aspeed SSIF BMC driver

2021-04-01 Thread Quan Nguyen


On 02/04/2021 00:09, Rob Herring wrote:

On Tue, Mar 30, 2021 at 09:10:29PM +0700, Quan Nguyen wrote:

Add device tree binding document for the Aspeed SSIF BMC driver.

Signed-off-by: Quan Nguyen 
---
  .../bindings/ipmi/aspeed-ssif-bmc.txt  | 18 ++
  1 file changed, 18 insertions(+)
  create mode 100644 Documentation/devicetree/bindings/ipmi/aspeed-ssif-bmc.txt


Same comment as you ignored on v1.



Dear Rob,
I really did not mean to do that.

What happen is that there was a compilation error complaint by kernel 
robot test on my v1. So I tried to send my v2 to fix that issue asap. 
Unfortunately, your reply on my v1 arrived just right after I hit "git 
send-email" to send out my v2.


For this comment, I'll switch to use yaml file in next version.

- Quan

Re: [PATCH v1 3/3] KEYS: trusted: Introduce support for NXP CAAM-based trusted keys

2021-04-01 Thread Serge E. Hallyn

On Wed, Mar 24, 2021 at 09:14:02AM -0700, James Bottomley wrote:
> On Tue, 2021-03-23 at 14:07 -0400, Mimi Zohar wrote:
> > On Tue, 2021-03-23 at 17:35 +0100, Ahmad Fatoum wrote:
> > > Hello Horia,
> > > 
> > > On 21.03.21 21:48, Horia Geantă wrote:
> > > > On 3/16/2021 7:02 PM, Ahmad Fatoum wrote:
> > > > [...]
> > > > > +struct trusted_key_ops caam_trusted_key_ops = {
> > > > > + .migratable = 0, /* non-migratable */
> > > > > + .init = trusted_caam_init,
> > > > > + .seal = trusted_caam_seal,
> > > > > + .unseal = trusted_caam_unseal,
> > > > > + .exit = trusted_caam_exit,
> > > > > +};
> > > > caam has random number generation capabilities, so it's worth
> > > > using that
> > > > by implementing .get_random.
> > > 
> > > If the CAAM HWRNG is already seeding the kernel RNG, why not use
> > > the kernel's?
> > > 
> > > Makes for less code duplication IMO.
> > 
> > Using kernel RNG, in general, for trusted keys has been discussed
> > before.   Please refer to Dave Safford's detailed explanation for not
> > using it [1].
> > 
> > thanks,
> > 
> > Mimi
> > 
> > [1] 
> > https://lore.kernel.org/linux-integrity/bca04d5d9a3b764c9b7405bba4d4a3c035f2a...@alpmbapa12.e2k.ad.ge.com/
> 
> I still don't think relying on one source of randomness to be
> cryptographically secure is a good idea.  The fear of bugs in the
> kernel entropy pool is reasonable, but since it's widely used they're
> unlikely to persist very long.

I'm not sure I agree - remember
https://www.schneier.com/blog/archives/2008/05/random_number_b.html ?  You'd
surely expect that to have been found quickly.

>   Studies have shown that some TPMs
> (notably the chinese manufactured ones) have suspicious failures in
> their RNGs:
> 
> https://www.researchgate.net/publication/45934562_Benchmarking_the_True_Random_Number_Generator_of_TPM_Chips
> 
> And most cryptograhpers recommend using a TPM for entropy mixing rather
> than directly:
> 
> https://blog.cryptographyengineering.com/category/rngs/
> 
> The TPMFail paper also shows that in spite of NIST certification
> things can go wrong with a TPM:
> 
> https://tpm.fail/

In this thread I've seen argument over "which is better" and "which is user 
api",
but noone's mentioned fips.  Unfortunately, so long as kernel rng refuses to be
fips-friendly (cf https://lkml.org/lkml/2020/9/21/157), making CAAM based 
trusted
keys depend on kernel rng would make them impossible to use in fips certified
applications without a forked kernel.

So I definitely am in favor of a config or kernel command line option to drive
which rng to use.

[PATCH] KVM: SVM: Add support for KVM_SEV_SEND_CANCEL command

2021-04-01 Thread Steve Rutherford

After completion of SEND_START, but before SEND_FINISH, the source VMM can
issue the SEND_CANCEL command to stop a migration. This is necessary so
that a cancelled migration can restart with a new target later.

Signed-off-by: Steve Rutherford 
---
 .../virt/kvm/amd-memory-encryption.rst|  9 +++
 arch/x86/kvm/svm/sev.c| 24 +++
 include/linux/psp-sev.h   | 10 
 include/uapi/linux/kvm.h  |  2 ++
 4 files changed, 45 insertions(+)

diff --git a/Documentation/virt/kvm/amd-memory-encryption.rst 
b/Documentation/virt/kvm/amd-memory-encryption.rst
index 469a6308765b1..9e018a3eec03b 100644
--- a/Documentation/virt/kvm/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/amd-memory-encryption.rst
@@ -284,6 +284,15 @@ Returns: 0 on success, -negative on error
 __u32 len;
 };
 
+16. KVM_SEV_SEND_CANCEL
+
+
+After completion of SEND_START, but before SEND_FINISH, the source VMM can 
issue the
+SEND_CANCEL command to stop a migration. This is necessary so that a cancelled
+migration can restart with a new target later.
+
+Returns: 0 on success, -negative on error
+
 References
 ==
 
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 83e00e5245136..88e72102cb900 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -1110,6 +1110,27 @@ static int sev_get_attestation_report(struct kvm *kvm, 
struct kvm_sev_cmd *argp)
return ret;
 }
 
+static int sev_send_cancel(struct kvm *kvm, struct kvm_sev_cmd *argp)
+{
+   struct kvm_sev_info *sev = _kvm_svm(kvm)->sev_info;
+   struct sev_data_send_cancel *data;
+   int ret;
+
+   if (!sev_guest(kvm))
+   return -ENOTTY;
+
+   data = kzalloc(sizeof(*data), GFP_KERNEL);
+   if (!data)
+   return -ENOMEM;
+
+   data->handle = sev->handle;
+   ret = sev_issue_cmd(kvm, SEV_CMD_SEND_CANCEL, data, >error);
+
+   kfree(data);
+   return ret;
+}
+
+
 int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
 {
struct kvm_sev_cmd sev_cmd;
@@ -1163,6 +1184,9 @@ int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
case KVM_SEV_GET_ATTESTATION_REPORT:
r = sev_get_attestation_report(kvm, _cmd);
break;
+   case KVM_SEV_SEND_CANCEL:
+   r = sev_send_cancel(kvm, _cmd);
+   break;
default:
r = -EINVAL;
goto out;
diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h
index b801ead1e2bb5..74f2babffc574 100644
--- a/include/linux/psp-sev.h
+++ b/include/linux/psp-sev.h
@@ -73,6 +73,7 @@ enum sev_cmd {
SEV_CMD_SEND_UPDATE_DATA= 0x041,
SEV_CMD_SEND_UPDATE_VMSA= 0x042,
SEV_CMD_SEND_FINISH = 0x043,
+   SEV_CMD_SEND_CANCEL = 0x044,
 
/* Guest migration commands (incoming) */
SEV_CMD_RECEIVE_START   = 0x050,
@@ -392,6 +393,15 @@ struct sev_data_send_finish {
u32 handle; /* In */
 } __packed;
 
+/**
+ * struct sev_data_send_cancel - SEND_CANCEL command parameters
+ *
+ * @handle: handle of the VM to process
+ */
+struct sev_data_send_cancel {
+   u32 handle; /* In */
+} __packed;
+
 /**
  * struct sev_data_receive_start - RECEIVE_START command parameters
  *
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f6afee209620d..707469b6b7072 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1671,6 +1671,8 @@ enum sev_cmd_id {
KVM_SEV_CERT_EXPORT,
/* Attestation report */
KVM_SEV_GET_ATTESTATION_REPORT,
+   /* Guest Migration Extension */
+   KVM_SEV_SEND_CANCEL,
 
KVM_SEV_NR_MAX,
 };
-- 
2.31.0.208.g409f899ff0-goog

Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl

2021-04-01 Thread Steve Rutherford

On Fri, Mar 19, 2021 at 11:00 AM Ashish Kalra  wrote:
>
> On Thu, Mar 11, 2021 at 12:48:07PM -0800, Steve Rutherford wrote:
> > On Thu, Mar 11, 2021 at 10:15 AM Ashish Kalra  wrote:
> > >
> > > On Wed, Mar 03, 2021 at 06:54:41PM +, Will Deacon wrote:
> > > > [+Marc]
> > > >
> > > > On Tue, Mar 02, 2021 at 02:55:43PM +, Ashish Kalra wrote:
> > > > > On Fri, Feb 26, 2021 at 09:44:41AM -0800, Sean Christopherson wrote:
> > > > > > On Fri, Feb 26, 2021, Ashish Kalra wrote:
> > > > > > > On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford wrote:
> > > > > > > > On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra 
> > > > > > > >  wrote:
> > > > > > > > Thanks for grabbing the data!
> > > > > > > >
> > > > > > > > I am fine with both paths. Sean has stated an explicit desire 
> > > > > > > > for
> > > > > > > > hypercall exiting, so I think that would be the current 
> > > > > > > > consensus.
> > > > > >
> > > > > > Yep, though it'd be good to get Paolo's input, too.
> > > > > >
> > > > > > > > If we want to do hypercall exiting, this should be in a 
> > > > > > > > follow-up
> > > > > > > > series where we implement something more generic, e.g. a 
> > > > > > > > hypercall
> > > > > > > > exiting bitmap or hypercall exit list. If we are taking the 
> > > > > > > > hypercall
> > > > > > > > exit route, we can drop the kvm side of the hypercall.
> > > > > >
> > > > > > I don't think this is a good candidate for arbitrary hypercall 
> > > > > > interception.  Or
> > > > > > rather, I think hypercall interception should be an orthogonal 
> > > > > > implementation.
> > > > > >
> > > > > > The guest, including guest firmware, needs to be aware that the 
> > > > > > hypercall is
> > > > > > supported, and the ABI needs to be well-defined.  Relying on 
> > > > > > userspace VMMs to
> > > > > > implement a common ABI is an unnecessary risk.
> > > > > >
> > > > > > We could make KVM's default behavior be a nop, i.e. have KVM 
> > > > > > enforce the ABI but
> > > > > > require further VMM intervention.  But, I just don't see the point, 
> > > > > > it would
> > > > > > save only a few lines of code.  It would also limit what KVM could 
> > > > > > do in the
> > > > > > future, e.g. if KVM wanted to do its own bookkeeping _and_ exit to 
> > > > > > userspace,
> > > > > > then mandatory interception would essentially make it impossible 
> > > > > > for KVM to do
> > > > > > bookkeeping while still honoring the interception request.
> > > > > >
> > > > > > However, I do think it would make sense to have the userspace exit 
> > > > > > be a generic
> > > > > > exit type.  But hey, we already have the necessary ABI defined for 
> > > > > > that!  It's
> > > > > > just not used anywhere.
> > > > > >
> > > > > >   /* KVM_EXIT_HYPERCALL */
> > > > > >   struct {
> > > > > >   __u64 nr;
> > > > > >   __u64 args[6];
> > > > > >   __u64 ret;
> > > > > >   __u32 longmode;
> > > > > >   __u32 pad;
> > > > > >   } hypercall;
> > > > > >
> > > > > >
> > > > > > > > Userspace could also handle the MSR using MSR filters (would 
> > > > > > > > need to
> > > > > > > > confirm that).  Then userspace could also be in control of the 
> > > > > > > > cpuid bit.
> > > > > >
> > > > > > An MSR is not a great fit; it's x86 specific and limited to 64 bits 
> > > > > > of data.
> > > > > > The data limitation could be fudged by shoving data into 
> > > > > > non-standard GPRs, but
> > > > > > that will result in truly heinous guest code, and extensibility 
> > > > > > issues.
> > > > > >
>
> We may also need to pass-through the MSR to userspace, as it is a part of this
> complete host (userspace/kernel), OVMF and guest kernel negotiation of
> the SEV live migration feature.
>
> Host (userspace/kernel) advertises it's support for SEV live migration
> feature via the CPUID bits, which is queried by OVMF and which in turn
> adds a new UEFI runtime variable to indicate support for SEV live
> migration, which is later queried during guest kernel boot and
> accordingly the guest does a wrmrsl() to custom MSR to complete SEV
> live migration negotiation and enable it.
>
> Now, the GET_SHARED_REGION_LIST ioctl returns error, until this MSR write
> enables SEV live migration, hence, preventing userspace to start live
> migration before the feature support has been negotiated and enabled on
> all the three components - host, guest OVMF and kernel.
>
> But, now with this ioctl not existing anymore, we will need to
> pass-through the MSR to userspace too, for it to only initiate live
> migration once the feature negotiation has been completed.
Hey Ashish,

I can't tell if you were waiting for feedback on this before posting
the follow-up patch series.

Here are a few options:
1) Add the MSR explicitly to the list of custom kvm MSRs, but don't
have it hooked up anywhere. The expectation would be for the VMM to
use msr intercepts to handle the reads and writes. If that seems
weird, have svm_set_msr (or

[PATCH -next v2] libbpf: remove redundant semi-colon

2021-04-01 Thread Yang Yingliang

Remove redundant semi-colon in infinalize_btf_ext().

Signed-off-by: Yang Yingliang 
---
v2:
  add commit log
---
 tools/lib/bpf/linker.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/lib/bpf/linker.c b/tools/lib/bpf/linker.c
index 46b16cbdcda3..4e08bc07e635 100644
--- a/tools/lib/bpf/linker.c
+++ b/tools/lib/bpf/linker.c
@@ -1895,7 +1895,7 @@ static int finalize_btf_ext(struct bpf_linker *linker)
hdr->func_info_len = funcs_sz;
hdr->line_info_off = funcs_sz;
hdr->line_info_len = lines_sz;
-   hdr->core_relo_off = funcs_sz + lines_sz;;
+   hdr->core_relo_off = funcs_sz + lines_sz;
hdr->core_relo_len = core_relos_sz;
 
if (funcs_sz) {
-- 
2.25.1

Re: [PATCH v2] ext4: Fix ext4_error_err save negative errno into superblock

2021-04-01 Thread Andreas Dilger

On Apr 1, 2021, at 1:40 AM, Ye Bin  wrote:
> 
> As read_mmp_block return 1 when failed. read_mmp_block return -EIO when buffer
> isn't uptodate.

Thank you for this second patch.  Unfortunately, the commit message
is still confusing/incorrect because it references read_mmp_block()
in the first usage but is actually changing write_mmp_block().

With that change you could add a Reviewed-by label from me.

Cheers, Andreas

> Fixes: 54d3adbc29f0 ("ext4: save all error info in save_error_info() and
> drop ext4_set_errno()")
> Reported-by: Liu Zhi Qiang 
> Signed-off-by: Ye Bin 
> ---
> fs/ext4/mmp.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
> index 795c3ff2907c..68fbeedd627b 100644
> --- a/fs/ext4/mmp.c
> +++ b/fs/ext4/mmp.c
> @@ -56,7 +56,7 @@ static int write_mmp_block(struct super_block *sb, struct 
> buffer_head *bh)
>   wait_on_buffer(bh);
>   sb_end_write(sb);
>   if (unlikely(!buffer_uptodate(bh)))
> - return 1;
> + return -EIO;
> 
>   return 0;
> }
> --
> 2.25.4
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP

Re: [PATCH v3 4/4] dt-bindings: serial: 8250: add aspeed,lpc-address and aspeed,sirq

2021-04-01 Thread Zev Weiss


On Thu, Apr 01, 2021 at 08:14:39PM CDT, Andrew Jeffery wrote:



On Fri, 2 Apr 2021, at 11:17, Zev Weiss wrote:

These correspond to the existing lpc_address, sirq, and sirq_polarity
sysfs attributes; the second element of aspeed,sirq provides a
replacement for the deprecated aspeed,sirq-polarity-sense property.

Signed-off-by: Zev Weiss 
---
 .../devicetree/bindings/serial/8250.yaml  | 27 ---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/Documentation/devicetree/bindings/serial/8250.yaml
b/Documentation/devicetree/bindings/serial/8250.yaml
index 491b9297432d..a6e01f9b745f 100644
--- a/Documentation/devicetree/bindings/serial/8250.yaml
+++ b/Documentation/devicetree/bindings/serial/8250.yaml
@@ -12,8 +12,13 @@ maintainers:
 allOf:
   - $ref: /schemas/serial.yaml#
   - if:
-  required:
-- aspeed,sirq-polarity-sense
+  anyOf:
+- required:
+- aspeed,lpc-address


Why not aspeed,lpc-io-reg like the KCS binding?

There are some things we can do to improve it, but we shouldn't go and invent 
something different. I prefer aspeed,lpc-io-reg because it's name derives from 
the generic 'reg' property as does it's behaviour (if you assume a related 
`#size-cells = 0`).


+- required:
+- aspeed,sirq


Why not aspeed,lpc-interrupts like the KCS binding?

The generic IRQ property is 'interrupts', so like aspeed,lpc-io-reg the 
interrupts proposal for KCS follows in name and form. I'm hiding it behind the 
aspeed vendor prefix for now while I'm not satisfied that I understand the 
requirements of non-aspeed parts. Similarly, I added the lpc prefix because we 
don't tend to describe the host devicetree in the BMC devicetree (and so 
there's no parent interrupt controller that we can reference via a phandle) and 
we need a way to differentiate from the local interrupts property.

I don't see a reason for either of them to differ from what we already have for 
KCS, and I don't see any reason to continue the sysfs naming scheme in the 
binding.



Ah, OK -- I was aiming for consistency with the existing vuart sysfs 
attributes, but if that's not a worthwhile concern I'm fine with 
aspeed,lpc-io-reg & aspeed,lpc-interrupts.



Zev

Re: [PATCH 1/3] srcu: Remove superfluous ssp initialization on deferred work queue

2021-04-01 Thread Paul E. McKenney

On Fri, Apr 02, 2021 at 02:58:13AM +0200, Frederic Weisbecker wrote:
> On Thu, Apr 01, 2021 at 05:48:56PM -0700, Paul E. McKenney wrote:
> > On Fri, Apr 02, 2021 at 01:47:02AM +0200, Frederic Weisbecker wrote:
> > > When an ssp has already started a grace period and queued an early work
> > > to flush after SRCU workqueues are created, we expect the ssp to be
> > > properly initialized already. So we can skip this step at this stage.
> > > 
> > > Signed-off-by: Frederic Weisbecker 
> > > Cc: Boqun Feng 
> > > Cc: Lai Jiangshan 
> > > Cc: Neeraj Upadhyay 
> > > Cc: Josh Triplett 
> > > Cc: Joel Fernandes 
> > > Cc: Uladzislau Rezki 
> > > ---
> > >  kernel/rcu/srcutree.c | 1 -
> > >  1 file changed, 1 deletion(-)
> > > 
> > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > > index 036ff5499ad5..7197156418e4 100644
> > > --- a/kernel/rcu/srcutree.c
> > > +++ b/kernel/rcu/srcutree.c
> > > @@ -1396,7 +1396,6 @@ void __init srcu_init(void)
> > >   while (!list_empty(_boot_list)) {
> > >   ssp = list_first_entry(_boot_list, struct srcu_struct,
> > > work.work.entry);
> > > - check_init_srcu_struct(ssp);
> > 
> > You lost me on this one.  What happens if the only pre-initialization
> > invocation on the statically allocated srcu_struct pointed to by ssp
> > was call_srcu()?  I am not seeing how the initialization has already
> > happened in that case.
> > 
> > What am I missing here?
> 
> call_srcu() -> __call_srcu() -> srcu_gp_start_if_needed() ->
> check_init_srcu_struct() ?
> 
> Or is it me missing something?

Nope, me getting confused between Tree SRCU's and Tiny SRCU's
call_srcu() implementation.  :-/

I have queued this patch and started testing.

Thanx, Paul

Re: [PATCH v3] mm,hwpoison: return -EHWPOISON when page already poisoned

2021-04-01 Thread Aili Yao

On Thu, 1 Apr 2021 08:33:20 -0700
"Luck, Tony"  wrote:

> On Wed, Mar 31, 2021 at 07:25:40PM +0800, Aili Yao wrote:
> > When the page is already poisoned, another memory_failure() call in the
> > same page now return 0, meaning OK. For nested memory mce handling, this
> > behavior may lead to one mce looping, Example:
> > 
> > 1.When LCME is enabled, and there are two processes A && B running on
> > different core X && Y separately, which will access one same page, then
> > the page corrupted when process A access it, a MCE will be rasied to
> > core X and the error process is just underway.
> > 
> > 2.Then B access the page and trigger another MCE to core Y, it will also
> > do error process, it will see TestSetPageHWPoison be true, and 0 is
> > returned.
> > 
> > 3.The kill_me_maybe will check the return:
> > 
> > 1244 static void kill_me_maybe(struct callback_head *cb)
> > 1245 {
> > 
> > 1254 if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> > 1255 !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> > 1256 set_mce_nospec(p->mce_addr >> PAGE_SHIFT,
> > p->mce_whole_page);
> > 1257 sync_core();
> > 1258 return;
> > 1259 }
> > 
> > 1267 }
> 
> With your change memory_failure() will return -EHWPOISON for the
> second task that consumes poison ... so that "if" statement won't
> be true and so we fall into the following code:
> 
> 1273 if (p->mce_vaddr != (void __user *)-1l) {
> 1274 force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, 
> PAGE_SHIFT);
> 1275 } else {
> 1276 pr_err("Memory error not recovered");
> 1277 kill_me_now(cb);
> 1278 }
> 
> If this was a copy_from_user() machine check, p->mce_vaddr is set and
> the task gets a BUS_MCEERR_AR SIGBUS, otherwise we print that
> 
>   "Memory error not recovered"
> 
> message and send a generic SIGBUS.  I don't think either of those options
> is right.
> 
> Combined with my "mutex" patch (to get rid of races where 2nd process returns
> early, but first process is still looking for mappings to unmap and tasks
> to signal) this patch moves forward a bit. But I think it needs an
> additional change here in kill_me_maybe() to just "return" if there is a
> EHWPOISON return from memory_failure()
> 
Got this, Thanks for your reply!
I will dig into this!

-- 
Thanks!
Aili Yao

Re: [f2fs-dev] [PATCH] f2fs: set checkpoint_merge by default

2021-04-01 Thread Chao Yu


On 2021/4/2 8:42, Jaegeuk Kim wrote:

Once we introduced checkpoint_merge, we've seen some contention w/o the option.
In order to avoid it, let's set it by default.

Signed-off-by: Jaegeuk Kim 


Reviewed-by: Chao Yu 

Thanks,

Re: [PATCH v3 4/4] dt-bindings: serial: 8250: add aspeed,lpc-address and aspeed,sirq

2021-04-01 Thread Andrew Jeffery

On Fri, 2 Apr 2021, at 11:17, Zev Weiss wrote:
> These correspond to the existing lpc_address, sirq, and sirq_polarity
> sysfs attributes; the second element of aspeed,sirq provides a
> replacement for the deprecated aspeed,sirq-polarity-sense property.
> 
> Signed-off-by: Zev Weiss 
> ---
>  .../devicetree/bindings/serial/8250.yaml  | 27 ---
>  1 file changed, 24 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/serial/8250.yaml 
> b/Documentation/devicetree/bindings/serial/8250.yaml
> index 491b9297432d..a6e01f9b745f 100644
> --- a/Documentation/devicetree/bindings/serial/8250.yaml
> +++ b/Documentation/devicetree/bindings/serial/8250.yaml
> @@ -12,8 +12,13 @@ maintainers:
>  allOf:
>- $ref: /schemas/serial.yaml#
>- if:
> -  required:
> -- aspeed,sirq-polarity-sense
> +  anyOf:
> +- required:
> +- aspeed,lpc-address

Why not aspeed,lpc-io-reg like the KCS binding?

There are some things we can do to improve it, but we shouldn't go and invent 
something different. I prefer aspeed,lpc-io-reg because it's name derives from 
the generic 'reg' property as does it's behaviour (if you assume a related 
`#size-cells = 0`).

> +- required:
> +- aspeed,sirq

Why not aspeed,lpc-interrupts like the KCS binding?

The generic IRQ property is 'interrupts', so like aspeed,lpc-io-reg the 
interrupts proposal for KCS follows in name and form. I'm hiding it behind the 
aspeed vendor prefix for now while I'm not satisfied that I understand the 
requirements of non-aspeed parts. Similarly, I added the lpc prefix because we 
don't tend to describe the host devicetree in the BMC devicetree (and so 
there's no parent interrupt controller that we can reference via a phandle) and 
we need a way to differentiate from the local interrupts property.

I don't see a reason for either of them to differ from what we already have for 
KCS, and I don't see any reason to continue the sysfs naming scheme in the 
binding.

Eventually I want to distil an LPC peripheral binding schema from what we've 
developed for the peripherals supported by aspeed and nuvoton SoCs.

Cheers,

Andrew

> +- required:
> +- aspeed,sirq-polarity-sense
>  then:
>properties:
>  compatible:
> @@ -190,6 +195,20 @@ properties:
>applicable to aspeed,ast2500-vuart.
>  deprecated: true
>  
> +  aspeed,lpc-address:
> +$ref: '/schemas/types.yaml#/definitions/uint32'
> +description: |
> +  The VUART LPC address.  Only applicable to aspeed,ast2500-vuart.
> +
> +  aspeed,sirq:
> +$ref: "/schemas/types.yaml#/definitions/uint32-array"
> +minItems: 2
> +maxItems: 2
> +description: |
> +  A 2-cell property describing the VUART SIRQ number and SIRQ
> +  polarity (IRQ_TYPE_LEVEL_LOW or IRQ_TYPE_LEVEL_HIGH).  Only
> +  applicable to aspeed,ast2500-vuart.
> +
>  required:
>- reg
>- interrupts
> @@ -221,6 +240,7 @@ examples:
>  };
>- |
>  #include 
> +#include 
>  serial@1e787000 {
>  compatible = "aspeed,ast2500-vuart";
>  reg = <0x1e787000 0x40>;
> @@ -228,7 +248,8 @@ examples:
>  interrupts = <8>;
>  clocks = < ASPEED_CLK_APB>;
>  no-loopback-test;
> -aspeed,sirq-polarity-sense = < 0x70 25>;
> +aspeed,lpc-address = <0x3f8>;
> +aspeed,sirq = <4 IRQ_TYPE_LEVEL_LOW>;
>  };
>  
>  ...
> -- 
> 2.31.1
> 
>

Re: [PATCH 3/3] srcu: Fix broken node geometry after early ssp init

2021-04-01 Thread Paul E. McKenney

On Fri, Apr 02, 2021 at 01:47:04AM +0200, Frederic Weisbecker wrote:
> An ssp initialized before rcu_init_geometry() will have its snp hierarchy
> based on CONFIG_NR_CPUS.
> 
> Once rcu_init_geometry() is called, the nodes distribution is shrinked
> and optimized toward meeting the actual possible number of CPUs detected
> on boot.
> 
> Later on, the ssp that was initialized before rcu_init_geometry() is
> confused and sometimes refers to its initial CONFIG_NR_CPUS based node
> hierarchy, sometimes to the new num_possible_cpus() based one instead.
> For example each of its sdp->mynode remain backward and refer to the
> early node leaves that may not exist anymore. On the other hand the
> srcu_for_each_node_breadth_first() refers to the new node hierarchy.
> 
> There are at least two bad possible outcomes to this:
> 
> 1) a) A callback enqueued early on an sdp is recorded pending on
>   sdp->mynode->srcu_data_have_cbs in srcu_funnel_gp_start() with
>   sdp->mynode pointing to a deep leaf (say 3 levels).
> 
>b) The grace period ends after rcu_init_geometry() which shrinks the
>   nodes level to a single one. srcu_gp_end() walks through the new
>   snp hierarchy without ever reaching the old leaves so the callback
>   is never executed.
> 
>This is easily reproduced on an 8 CPUs machine with
>CONFIG_NR_CPUS >= 32 and "rcupdate.rcu_self_test=1". The
>srcu_barrier() after early tests verification never completes and
>the boot hangs:
> 
>   [ 5413.141029] INFO: task swapper/0:1 blocked for more than 4915 
> seconds.
>   [ 5413.147564]   Not tainted 5.12.0-rc4+ #28
>   [ 5413.151927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
>   [ 5413.159753] task:swapper/0   state:D stack:0 pid:1 ppid: 
> 0 flags:0x4000
>   [ 5413.168099] Call Trace:
>   [ 5413.170555]  __schedule+0x36c/0x930
>   [ 5413.174057]  ? wait_for_completion+0x88/0x110
>   [ 5413.178423]  schedule+0x46/0xf0
>   [ 5413.181575]  schedule_timeout+0x284/0x380
>   [ 5413.185591]  ? wait_for_completion+0x88/0x110
>   [ 5413.189957]  ? mark_held_locks+0x61/0x80
>   [ 5413.193882]  ? mark_held_locks+0x61/0x80
>   [ 5413.197809]  ? _raw_spin_unlock_irq+0x24/0x50
>   [ 5413.202173]  ? wait_for_completion+0x88/0x110
>   [ 5413.206535]  wait_for_completion+0xb4/0x110
>   [ 5413.210724]  ? srcu_torture_stats_print+0x110/0x110
>   [ 5413.215610]  srcu_barrier+0x187/0x200
>   [ 5413.219277]  ? rcu_tasks_verify_self_tests+0x50/0x50
>   [ 5413.224244]  ? rdinit_setup+0x2b/0x2b
>   [ 5413.227907]  rcu_verify_early_boot_tests+0x2d/0x40
>   [ 5413.232700]  do_one_initcall+0x63/0x310
>   [ 5413.236541]  ? rdinit_setup+0x2b/0x2b
>   [ 5413.240207]  ? rcu_read_lock_sched_held+0x52/0x80
>   [ 5413.244912]  kernel_init_freeable+0x253/0x28f
>   [ 5413.249273]  ? rest_init+0x250/0x250
>   [ 5413.252846]  kernel_init+0xa/0x110
>   [ 5413.256257]  ret_from_fork+0x22/0x30
> 
> 2) An ssp that gets initialized before rcu_init_geometry() and used
>afterward will always have stale rdp->mynode references, resulting in
>callbacks to be missed in srcu_gp_end(), just like in the previous
>scenario.
> 
> Solve this with fixing the node tree layout of early initialized ssp
> once rcu_init_geometry() is done. Unfortunately this involves a new
> field into struct srcu_struct to postpone the ssp hierarchy rebuild.
> 
> Signed-off-by: Frederic Weisbecker 
> Cc: Boqun Feng 
> Cc: Lai Jiangshan 
> Cc: Neeraj Upadhyay 
> Cc: Josh Triplett 
> Cc: Joel Fernandes 
> Cc: Uladzislau Rezki 

Again, good catch and thank you!

One question below.

Thanx, Paul

> ---
>  include/linux/srcutree.h |  1 +
>  kernel/rcu/srcutree.c| 68 +++-
>  2 files changed, 62 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h
> index 9cfcc8a756ae..4339e5794a72 100644
> --- a/include/linux/srcutree.h
> +++ b/include/linux/srcutree.h
> @@ -85,6 +85,7 @@ struct srcu_struct {
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
>   struct lockdep_map dep_map;
>  #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> + struct list_head early_init;
>  };
>  
>  /* Values for state variable (bottom bits of ->srcu_gp_seq). */
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 10e681ea7051..285f0c053754 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -39,7 +39,7 @@ static ulong counter_wrap_check = (ULONG_MAX >> 2);
>  module_param(counter_wrap_check, ulong, 0444);
>  
>  /* Early-boot callback-management, so early that no lock is required! */
> -static LIST_HEAD(srcu_boot_list);
> +static LIST_HEAD(srcu_boot_queue_list);
>  static bool __read_mostly srcu_init_done;
>  
>  static void srcu_invoke_callbacks(struct work_struct *work);
> @@ -133,7 +133,9 @@

Re: [PATCH v2] firmware: qcom_scm: Only compile legacy calls on ARM

2021-04-01 Thread Elliot Berman





On 3/23/2021 3:43 PM, Stephen Boyd wrote:

These scm calls are never used outside of legacy ARMv7 based platforms.
That's because PSCI, mandated on arm64, implements them for modern SoCs
via the PSCI spec. Let's move them to the legacy file and only compile
the legacy file into the kernel when CONFIG_ARM=y. Otherwise provide
stubs and fail the calls. This saves a little bit of space in an
arm64 allmodconfig.

  $ ./scripts/bloat-o-meter vmlinux.before vmlinux.after
  add/remove: 0/8 grow/shrink: 5/6 up/down: 509/-4401 (-3892)
  Function old new   delta
  __qcom_scm_set_dload_mode.constprop  312 452+140
  qcom_scm_qsmmu500_wait_safe_toggle   288 416+128
  qcom_scm_io_writel   288 408+120
  qcom_scm_io_readl376 492+116
  __param_str_download_mode 23  28  +5
  __warned43274326  -1
  e843419@0b3f_00010432_324  8   -  -8
  qcom_scm_call228 208 -20
  CSWTCH  59255877 -48
  _sub_I_65535_1163100  163040 -60
  _sub_D_65535_0163100  163040 -60
  qcom_scm_wb   64   - -64
  qcom_scm_lock320 160-160
  qcom_scm_call_atomic 212   --212
  qcom_scm_cpu_power_down  308   --308
  scm_legacy_call_atomic   520   --520
  qcom_scm_set_warm_boot_addr  720   --720
  qcom_scm_set_cold_boot_addr  728   --728
  scm_legacy_call 1492   -   -1492
  Total: Before=66737606, After=66733714, chg -0.01%

Commit 9a434cee773a ("firmware: qcom_scm: Dynamically support SMCCC and
legacy conventions") didn't mention any motivating factors for keeping
the legacy code around on arm64 kernels, i.e. presumably that commit
wasn't trying to support these legacy APIs on arm64 kernels.

Cc: Elliot Berman 
Cc: Brian Masney 
Cc: Stephan Gerhold 
Cc: Jeffrey Hugo 
Cc: Douglas Anderson 
Signed-off-by: Stephen Boyd 
---
It might be a good idea to wrap these lines from qcom_scm_call with #if 
IS_ENABLED(CONFIG_ARM), and the corresponding ones in qcom_scm_call_atomic:


  case SMC_CONVENTION_LEGACY:
  return scm_legacy_call(dev, desc, res);

If something is wrong with loaded firmware and LEGACY convention is 
incorrectly selected, you would get a better hint about the problem: 
"Unknown current SCM calling convention." You would still get the hint 
earlier from __get_convention, but that may not be obvious to someone 
unfamiliar with the SCM driver.


I'll defer to your/Bjorn's preference:

Acked-by: Elliot Berman 

with or without modifying qcom_scm_call{_atomic}.



Followup to v1 
(https://lore.kernel.org/r/20210223214539.1336155-7-swb...@chromium.org):
  * Don't change the legacy file to use legacy calls only
  * Wrap more things in CONFIG_ARM checks

  drivers/firmware/Makefile   |  4 +++-
  drivers/firmware/qcom_scm.c | 47 -
  drivers/firmware/qcom_scm.h | 15 
  include/linux/qcom_scm.h| 21 ++---
  4 files changed, 56 insertions(+), 31 deletions(-)

diff --git a/drivers/firmware/Makefile b/drivers/firmware/Makefile
index 5e013b6a3692..0b7b3a6c 100644
--- a/drivers/firmware/Makefile
+++ b/drivers/firmware/Makefile
@@ -17,7 +17,9 @@ obj-$(CONFIG_ISCSI_IBFT)  += iscsi_ibft.o
  obj-$(CONFIG_FIRMWARE_MEMMAP) += memmap.o
  obj-$(CONFIG_RASPBERRYPI_FIRMWARE) += raspberrypi.o
  obj-$(CONFIG_FW_CFG_SYSFS)+= qemu_fw_cfg.o
-obj-$(CONFIG_QCOM_SCM) += qcom_scm.o qcom_scm-smc.o qcom_scm-legacy.o
+obj-$(CONFIG_QCOM_SCM) += qcom_scm_objs.o
+qcom_scm_objs-$(CONFIG_ARM)+= qcom_scm-legacy.o
+qcom_scm_objs-$(CONFIG_QCOM_SCM) += qcom_scm.o qcom_scm-smc.o
  obj-$(CONFIG_TI_SCI_PROTOCOL) += ti_sci.o
  obj-$(CONFIG_TRUSTED_FOUNDATIONS) += trusted_foundations.o
  obj-$(CONFIG_TURRIS_MOX_RWTM) += turris-mox-rwtm.o
diff --git a/drivers/firmware/qcom_scm.c b/drivers/firmware/qcom_scm.c
index ee9cb545e73b..747808a8ddf4 100644
--- a/drivers/firmware/qcom_scm.c
+++ b/drivers/firmware/qcom_scm.c
@@ -49,28 +49,6 @@ struct qcom_scm_mem_map_info {
__le64 mem_size;
  };
  
-#define QCOM_SCM_FLAG_COLDBOOT_CPU0	0x00

-#define QCOM_SCM_FLAG_COLDBOOT_CPU10x01
-#define QCOM_SCM_FLAG_COLDBOOT_CPU20x08
-#define QCOM_SCM_FLAG_COLDBOOT_CPU30x20
-
-#define QCOM_SCM_FLAG_WARMBOOT_CPU00x04
-#define QCOM_SCM_FLAG_WARMBOOT_CPU10x02
-#define QCOM_SCM_FLAG_WARMBOOT_CPU20x10
-#define QCOM_SCM_FLAG_WARMBOOT_CPU30x40
-
-struct qcom_scm_wb_entry {
-   int flag;
-   void *entry;
-};
-
-static struct qcom_scm_wb_entry qcom_scm_wb[] = {
-

Re: [PATCH] ext4: Fix ext4_error_err save negative errno into superblock

2021-04-01 Thread Andreas Dilger

On Apr 1, 2021, at 1:22 AM, Ye Bin  wrote:
> 
> As read_mmp_block return 1 when failed, so just pass retval to
> save_error_info.

Thank you for submitting this patch, but it should not be accepted.

The commit message is confusing, since the code being changed relates
to retval from write_mmp_block().  That currently returns 1, but
only until your next patch is applied.

I think it is better to fix write_mmp_block() as in your next patch
to return a negative value to be more consistent with other code.

Cheers, Andreas

> Fixes: 54d3adbc29f0 ("ext4: save all error info in save_error_info() and
> drop ext4_set_errno()")
> Reported-by: Liu Zhi Qiang 
> Signed-off-by: Ye Bin 
> ---
> fs/ext4/mmp.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
> index 795c3ff2907c..bb8353e25841 100644
> --- a/fs/ext4/mmp.c
> +++ b/fs/ext4/mmp.c
> @@ -171,7 +171,7 @@ static int kmmpd(void *data)
>*/
>   if (retval) {
>   if ((failed_writes % 60) == 0) {
> - ext4_error_err(sb, -retval,
> + ext4_error_err(sb, retval,
>  "Error writing to MMP block");
>   }
>   failed_writes++;
> --
> 2.25.4
> 

Cheers, Andreas

signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 2/2] net: mdio: support c45 peripherals on c22 busses

2021-04-01 Thread Danilo Krummrich

On Thu, Apr 01, 2021 at 09:48:58AM +0100, Russell King - ARM Linux admin wrote:
> On Thu, Apr 01, 2021 at 03:23:05AM +0200, danilokrummr...@dk-develop.de wrote:
> > On 2021-03-31 20:35, Russell King - ARM Linux admin wrote:
> > > On Wed, Mar 31, 2021 at 07:58:33PM +0200, danilokrummr...@dk-develop.de
> > > wrote:
> > > > For this cited change the only thing happening is that if
> > > > get_phy_device()
> > > > already failed for probing with is_c45==false (C22 devices) it tries
> > > > to
> > > > probe with is_c45==true (C45 devices) which then either results into
> > > > actual
> > > > C45 frame transfers or indirect accesses by calling mdiobus_c45_*()
> > > > functions.
> > >
> > > Please explain why and how a PHY may not appear to be present using
> > > C22 frames to read the ID registers, but does appear to be present
> > > when using C22 frames to the C45 indirect registers - and summarise
> > > which PHYs have this behaviour.
> > >
> > > It seems very odd that any PHY would only implement C45 indirect
> > > registers in the C22 register space.
> >
> > Honestly, I can't list examples of that case (at least none that have an
> > upstream driver already). This part of my patch, to fall back to c45 bus
> > probing when c22 probing does not succeed, is also motivated by the fact
> > that this behaviour was already introduced with this patch:
>
> So, if I understand what you've just said, you want to make a change to
> the kernel to add support for something that you don't need and don't
> know that there's any hardware that needs it.  Is that correct?
>
No, not at all. As I explained this part of the patch in mdiobus_scan() I did
based on the patch of Jeremy only. It was an indicator for me that there might
be c45 PHYs that don't respond to c22 requests. I interpreted his commit
message in a way that those c45 PHYs are capable of processing c22 requests in
general but implement the indirect registers only, since he said "its possible
that a c45 device doesn't respond despite being a standard phy".

You said that this behaviour would be very odd and I agree. Now, likely I just
misinterpreted this and Jeremy actually tells that the PHYs he's referring to
don't support c22 access at all. In this case we can for sure just forget
about the changes in mdiobus_scan() of this patch. I'll remove them.

Again, just to sort this out, this part of the patch is not it's main purpose.
However, since I implemented the fallback to indirect accesses anyways I
thought it doesn't hurt to consider the case that a PHY implements indirect
access registers only. Anyways, I admit that this is likely pointless and as
said, I'll remove it from the patch.
> > commit 0cc8fecf041d3e5285380da62cc6662bdc942d8c
> > Author: Jeremy Linton 
> > Date:   Mon Jun 22 20:35:32 2020 +0530
> >
> > net: phy: Allow mdio buses to auto-probe c45 devices
> >
> > The mdiobus_scan logic is currently hardcoded to only
> > work with c22 devices. This works fairly well in most
> > cases, but its possible that a c45 device doesn't respond
> > despite being a standard phy. If the parent hardware
> > is capable, it makes sense to scan for c22 devices before
> > falling back to c45.
> >
> > As we want this to reflect the capabilities of the STA,
> > lets add a field to the mii_bus structure to represent
> > the capability. That way devices can opt into the extended
> > scanning. Existing users should continue to default to c22
> > only scanning as long as they are zero'ing the structure
> > before use.
> >
> > Signed-off-by: Jeremy Linton 
> > Signed-off-by: Calvin Johnson 
> > Signed-off-by: David S. Miller 
> >
> > In this patch i.a. the following lines were added.
> >
> > +   case MDIOBUS_C22_C45:
> > +   phydev = get_phy_device(bus, addr, false);
> > +   if (IS_ERR(phydev))
> > +   phydev = get_phy_device(bus, addr, true);
> > +   break;
> >
> > I'm applying the same logic for MDIOBUS_NO_CAP and MDIOBUS_C22, since
> > with my patch MDIO controllers with those capabilities can handle c45 bus
> > probing with indirect accesses.
>
> If the PHY doesn't respond to C22 accesses but is a C45 PHY, then how
> can this work (you seem to have essentially said it doesn't above.)
>
As stated above likely I wrongly interpreted his commit message as if only the
indirect registers are implemented.
> > [By the way, I'm unsure if this order for MDIO bus controllers with the
> > capability MDIOBUS_C22_C45 makes sense, because if we assume that the
> > majority of c45 PHYs responds well to c22 probing (which I'm convinced of)
>
> There are some which don't - Clause 45 allows PHYs not to implement
> support for Clause 22 accesses.
>
Yes, I agree.
> > the PHY would still be registered as is_c45==false, which results in the
> > fact
> > that even though the underlying bus is capable of real c45 framing only
> > indirect accessing would be performed. But

Re: [PATCH V2 3/5] DCC: Added the sysfs entries for DCC(Data Capture and Compare) driver

2021-04-01 Thread Stephen Boyd

Quoting schow...@codeaurora.org (2021-04-01 08:42:50)
> On 2021-03-30 01:39, Stephen Boyd wrote:
> > Quoting Souradeep Chowdhury (2021-03-25 01:02:34)
> >> The DCC is a DMA engine designed to store register values either in
> >> case of a system crash or in case of software triggers manually done
> >> by the user.Using DCC hardware and the sysfs interface of the driver
> >> the user can exploit various functionalities of DCC.The user can 
> >> specify
> >> the register addresses,the values of which is stored by DCC in it's
> >> dedicated SRAM.The register addresses can be used either to read from,
> >> write to,first read and store value and then write or to loop.All 
> >> these
> >> options can be exploited using the sysfs interface given to the user.
> >> Following are the sysfs interfaces exposed in DCC driver which are
> >> documented
> >> 1)trigger
> >> 2)config
> >> 3)config_write
> >> 4)config_reset
> >> 5)enable
> >> 6)rd_mod_wr
> >> 7)loop
> >> 
> >> Signed-off-by: Souradeep Chowdhury 
> >> ---
> >>  Documentation/ABI/testing/sysfs-driver-dcc | 114 
> >> +
> > 
> > Please combine this with the driver patch.
> 
> Ack
> 
> > 
> >>  1 file changed, 114 insertions(+)
> >>  create mode 100644 Documentation/ABI/testing/sysfs-driver-dcc
> > 
> > Perhaps this should be an ioctl interface instead of a sysfs interface?
> 
> The reasons for choosing sysfs over ioctl is as follows

Cool, please add these details to the commit text.

> 
> 
> i) As can be seen from the sysfs attribute descriptions, most of it does 
> basic hardware manipulations like dcc_enable, dcc_disable, config reset 
> etc. As a result sysfs is preferred over ioctl as we just need to enter 
> a 0 or 1
> signal in such cases.
> 
> ii) Existing similar debug hardwares are there for which drivers have 
> been written using sysfs interface. One such example is the 
> coresight-etm-trace driver. Following is the link for reference
> 
> https://www.kernel.org/doc/html/latest/trace/coresight/coresight-etm4x-reference.html

I wasn't deeply involved but I recall that the whole coresight sysfs
interface was disliked and it mostly got rewritten to go through the
perf tool instead. Pointing to the coresight sysfs API is not the best
approach here.

Maybe a closer analog would be the watchdog subsystem, which is ioctl
based and uses a character device like /dev/watchdog. Watchdogs are
simple debug features that reboot the system when everything goes wrong.
This looks like a hardware block that can be used to gather information
when the watchdog fires.

Reading the doc closer it is quite frightening that a device like this
can let you read registers in the hardware on-demand and even store
values of registers over time. This is like /dev/mem on steroids. This
needs to be highly restricted as it sounds like it could be used to
snoop on security keys that are set in the hardware or secrets stored in
memory. Is the hardware restricted at all in what it can read?

Good Day

2021-04-01 Thread Mr Mohammed Mashab

Good Day,

Please accept my apologies for writing you a surprise letter.I am
Mr.Mohammed Mashab, account Manager with an investment bank here in
Burkina Faso.I have a very important business I want to discuss with
you.There is a draft account opened in my firm by a long-time client
of our bank.I have the opportunity of transferring the left over fund
(15.8 Million UsDollars)Fiftheen Million Eight Hundred Thousand United
States of American Dollars of one of my Bank clients who died at the
collapsing of the world trade center at the United States on September
11th 2001.

I want to invest this funds and introduce you to our bank for this
deal.All I require is your honest co-operation and I guarantee you
that this will be executed under a legitimate arrangement that will
protect us from any breach of the law.I agree that 40% of this money
will be for you as my foreign partner,50% for me while 10% is for
establishing of foundation for the less privilleges in your country.If
you are really interested in my proposal further details of the
Transfer will be forwarded unto you as soon as I receive your
willingness mail for a successful transfer.

Yours Sincerely,
Mr.Mohammed Mashab,

Re: [GIT PULL] LTO fix for v5.12-rc6

2021-04-01 Thread pr-tracker-bot

The pull request you sent on Thu, 1 Apr 2021 14:39:19 -0700:

> https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git 
> tags/lto-v5.12-rc6

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/1678e493d530e7977cce34e59a86bb86f3c5631e

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

Re: [PATCH 1/3] srcu: Remove superfluous ssp initialization on deferred work queue

2021-04-01 Thread Paul E. McKenney

On Thu, Apr 01, 2021 at 05:48:56PM -0700, Paul E. McKenney wrote:
> On Fri, Apr 02, 2021 at 01:47:02AM +0200, Frederic Weisbecker wrote:
> > When an ssp has already started a grace period and queued an early work
> > to flush after SRCU workqueues are created, we expect the ssp to be
> > properly initialized already. So we can skip this step at this stage.
> > 
> > Signed-off-by: Frederic Weisbecker 
> > Cc: Boqun Feng 
> > Cc: Lai Jiangshan 
> > Cc: Neeraj Upadhyay 
> > Cc: Josh Triplett 
> > Cc: Joel Fernandes 
> > Cc: Uladzislau Rezki 
> > ---
> >  kernel/rcu/srcutree.c | 1 -
> >  1 file changed, 1 deletion(-)
> > 
> > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > index 036ff5499ad5..7197156418e4 100644
> > --- a/kernel/rcu/srcutree.c
> > +++ b/kernel/rcu/srcutree.c
> > @@ -1396,7 +1396,6 @@ void __init srcu_init(void)
> > while (!list_empty(_boot_list)) {
> > ssp = list_first_entry(_boot_list, struct srcu_struct,
> >   work.work.entry);
> > -   check_init_srcu_struct(ssp);
> 
> You lost me on this one.  What happens if the only pre-initialization
> invocation on the statically allocated srcu_struct pointed to by ssp
> was call_srcu()?  I am not seeing how the initialization has already
> happened in that case.
> 
> What am I missing here?

Idiot here was looking at Tiny SRCU's call_srcu(), not that of Tree SRCU.
Never mind, I will queue this one as well.

Thanx, Paul

> > list_del_init(>work.work.entry);
> > queue_work(rcu_gp_wq, >work.work);
> > }
> > -- 
> > 2.25.1
> >

Re: [PATCH 2/3] srcu: Remove superfluous sdp->srcu_lock_count zero filling

2021-04-01 Thread Paul E. McKenney

On Fri, Apr 02, 2021 at 01:47:03AM +0200, Frederic Weisbecker wrote:
> alloc_percpu() zeroes out the allocated memory. Therefore we can assume
> the whole struct srcu_data to be clear after calling it, just like after
> a static initialization. No need for a special treatment in the dynamic
> allocation case.
> 
> Signed-off-by: Frederic Weisbecker 
> Cc: Boqun Feng 
> Cc: Lai Jiangshan 
> Cc: Neeraj Upadhyay 
> Cc: Josh Triplett 
> Cc: Joel Fernandes 
> Cc: Uladzislau Rezki 

Good catch, thank you!!!  I queued the following with the usual
wordsmithing, so as usual please let me know if I messed anything up.

Thanx, Paul



commit 2cfdfbfc41bcd96e7961a619e4a7f235b274f78f
Author: Frederic Weisbecker 
Date:   Fri Apr 2 01:47:03 2021 +0200

srcu: Remove superfluous sdp->srcu_lock_count zero filling

Because alloc_percpu() zeroes out the allocated memory, there is no need
to zero-fill newly allocated per-CPU memory.  This commit therefore removes
the loop zeroing the ->srcu_lock_count and ->srcu_unlock_count arrays
from init_srcu_struct_nodes().  This is the only use of that function's
is_static parameter, which this commit also removes.

Signed-off-by: Frederic Weisbecker 
Cc: Boqun Feng 
Cc: Lai Jiangshan 
Cc: Neeraj Upadhyay 
Cc: Josh Triplett 
Cc: Joel Fernandes 
Cc: Uladzislau Rezki 
Signed-off-by: Paul E. McKenney 

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 036ff54..7389e46 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -80,7 +80,7 @@ do {  
\
  * srcu_read_unlock() running against them.  So if the is_static parameter
  * is set, don't initialize ->srcu_lock_count[] and ->srcu_unlock_count[].
  */
-static void init_srcu_struct_nodes(struct srcu_struct *ssp, bool is_static)
+static void init_srcu_struct_nodes(struct srcu_struct *ssp)
 {
int cpu;
int i;
@@ -148,14 +148,6 @@ static void init_srcu_struct_nodes(struct srcu_struct 
*ssp, bool is_static)
timer_setup(>delay_work, srcu_delay_timer, 0);
sdp->ssp = ssp;
sdp->grpmask = 1 << (cpu - sdp->mynode->grplo);
-   if (is_static)
-   continue;
-
-   /* Dynamically allocated, better be no srcu_read_locks()! */
-   for (i = 0; i < ARRAY_SIZE(sdp->srcu_lock_count); i++) {
-   sdp->srcu_lock_count[i] = 0;
-   sdp->srcu_unlock_count[i] = 0;
-   }
}
 }
 
@@ -179,7 +171,7 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp, 
bool is_static)
ssp->sda = alloc_percpu(struct srcu_data);
if (!ssp->sda)
return -ENOMEM;
-   init_srcu_struct_nodes(ssp, is_static);
+   init_srcu_struct_nodes(ssp);
ssp->srcu_gp_seq_needed_exp = 0;
ssp->srcu_last_gp_end = ktime_get_mono_fast_ns();
smp_store_release(>srcu_gp_seq_needed, 0); /* Init done. */

Re: [PATCH 1/3] srcu: Remove superfluous ssp initialization on deferred work queue

2021-04-01 Thread Frederic Weisbecker

On Thu, Apr 01, 2021 at 05:48:56PM -0700, Paul E. McKenney wrote:
> On Fri, Apr 02, 2021 at 01:47:02AM +0200, Frederic Weisbecker wrote:
> > When an ssp has already started a grace period and queued an early work
> > to flush after SRCU workqueues are created, we expect the ssp to be
> > properly initialized already. So we can skip this step at this stage.
> > 
> > Signed-off-by: Frederic Weisbecker 
> > Cc: Boqun Feng 
> > Cc: Lai Jiangshan 
> > Cc: Neeraj Upadhyay 
> > Cc: Josh Triplett 
> > Cc: Joel Fernandes 
> > Cc: Uladzislau Rezki 
> > ---
> >  kernel/rcu/srcutree.c | 1 -
> >  1 file changed, 1 deletion(-)
> > 
> > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > index 036ff5499ad5..7197156418e4 100644
> > --- a/kernel/rcu/srcutree.c
> > +++ b/kernel/rcu/srcutree.c
> > @@ -1396,7 +1396,6 @@ void __init srcu_init(void)
> > while (!list_empty(_boot_list)) {
> > ssp = list_first_entry(_boot_list, struct srcu_struct,
> >   work.work.entry);
> > -   check_init_srcu_struct(ssp);
> 
> You lost me on this one.  What happens if the only pre-initialization
> invocation on the statically allocated srcu_struct pointed to by ssp
> was call_srcu()?  I am not seeing how the initialization has already
> happened in that case.
> 
> What am I missing here?

call_srcu() -> __call_srcu() -> srcu_gp_start_if_needed() ->
check_init_srcu_struct() ?

Or is it me missing something?

[PATCH v2 10/10] KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if possible

2021-04-01 Thread Sean Christopherson

Let the TDP MMU yield when unmapping a range in response to a MMU
notification, if yielding is allowed by said notification.  There is no
reason to disallow yielding in this case, and in theory the range being
invalidated could be quite large.

Cc: Ben Gardon 
Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/mmu/tdp_mmu.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7797d24f0937..dd17d9673ff2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -885,7 +885,7 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct 
kvm_gfn_range *range,
 
for_each_tdp_mmu_root(kvm, root, range->slot->as_id)
flush |= zap_gfn_range(kvm, root, range->start, range->end,
-  false, flush);
+  range->may_block, flush);
 
return flush;
 }
@@ -903,6 +903,10 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct 
kvm *kvm,
 
rcu_read_lock();
 
+   /*
+* Don't support rescheduling, none of the MMU notifiers that funnel
+* into this helper allow blocking; it'd be dead, wasteful code.
+*/
for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
ret |= handler(kvm, , range);
-- 
2.31.0.208.g409f899ff0-goog

[PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary

2021-04-01 Thread Sean Christopherson

Avoid taking mmu_lock for unrelated .invalidate_range_{start,end}()
notifications.  Because mmu_notifier_count must be modified while holding
mmu_lock for write, and must always be paired across start->end to stay
balanced, lock elision must happen in both or none.  To meet that
requirement, add a rwsem to prevent memslot updates across range_start()
and range_end().

Use a rwsem instead of a rwlock since most notifiers _allow_ blocking,
and the lock will be endl across the entire start() ... end() sequence.
If anything in the sequence sleeps, including the caller or a different
notifier, holding the spinlock would be disastrous.

For notifiers that _disallow_ blocking, e.g. OOM reaping, simply go down
the slow path of unconditionally acquiring mmu_lock.  The sane
alternative would be to try to acquire the lock and force the notifier
to retry on failure.  But since OOM is currently the _only_ scenario
where blocking is disallowed attempting to optimize a guest that has been
marked for death is pointless.

Unconditionally define and use mmu_notifier_slots_lock in the memslots
code, purely to avoid more #ifdefs.  The overhead of acquiring the lock
is negligible when the lock is uncontested, which will always be the case
when the MMU notifiers are not used.

Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.

Based heavily on code from Ben Gardon.

Suggested-by: Ben Gardon 
Signed-off-by: Sean Christopherson 
---
 include/linux/kvm_host.h |  6 ++-
 virt/kvm/kvm_main.c  | 96 +++-
 2 files changed, 80 insertions(+), 22 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 40ac2d40bb5a..bc3dd2838bb8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -472,6 +472,7 @@ struct kvm {
 #endif /* KVM_HAVE_MMU_RWLOCK */
 
struct mutex slots_lock;
+   struct rw_semaphore mmu_notifier_slots_lock;
struct mm_struct *mm; /* userspace tied to this vm */
struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
@@ -660,8 +661,9 @@ static inline struct kvm_memslots *__kvm_memslots(struct 
kvm *kvm, int as_id)
 {
as_id = array_index_nospec(as_id, KVM_ADDRESS_SPACE_NUM);
return srcu_dereference_check(kvm->memslots[as_id], >srcu,
-   lockdep_is_held(>slots_lock) ||
-   !refcount_read(>users_count));
+ lockdep_is_held(>slots_lock) ||
+ 
lockdep_is_held(>mmu_notifier_slots_lock) ||
+ !refcount_read(>users_count));
 }
 
 static inline struct kvm_memslots *kvm_memslots(struct kvm *kvm)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f6697ad741ed..af28f39817a5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -462,6 +462,7 @@ struct kvm_hva_range {
pte_t pte;
hva_handler_t handler;
on_lock_fn_t on_lock;
+   bool must_lock;
bool flush_on_ret;
bool may_block;
 };
@@ -479,6 +480,25 @@ static void kvm_null_fn(void)
 }
 #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
 
+
+/* Acquire mmu_lock if necessary.  Returns %true if @handler is "null" */
+static __always_inline bool kvm_mmu_lock_and_check_handler(struct kvm *kvm,
+  const struct 
kvm_hva_range *range,
+  bool *locked)
+{
+   if (*locked)
+   return false;
+
+   *locked = true;
+
+   KVM_MMU_LOCK(kvm);
+
+   if (!IS_KVM_NULL_FN(range->on_lock))
+   range->on_lock(kvm, range->start, range->end);
+
+   return IS_KVM_NULL_FN(range->handler);
+}
+
 static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
  const struct kvm_hva_range 
*range)
 {
@@ -495,16 +515,9 @@ static __always_inline int __kvm_handle_hva_range(struct 
kvm *kvm,
 
idx = srcu_read_lock(>srcu);
 
-   /* The on_lock() path does not yet support lock elision. */
-   if (!IS_KVM_NULL_FN(range->on_lock)) {
-   locked = true;
-   KVM_MMU_LOCK(kvm);
-
-   range->on_lock(kvm, range->start, range->end);
-
-   if (IS_KVM_NULL_FN(range->handler))
-   goto out_unlock;
-   }
+   if (range->must_lock &&
+   kvm_mmu_lock_and_check_handler(kvm, range, ))
+   goto out_unlock;
 
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
slots = __kvm_memslots(kvm, i);
@@ -534,10 +547,9 @@ static __always_inline int __kvm_handle_hva_range(struct 
kvm *kvm,
gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE 
- 1, slot);

[PATCH v2 08/10] KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot

2021-04-01 Thread Sean Christopherson

Defer acquiring mmu_lock in the MMU notifier paths until a "hit" has been
detected in the memslots, i.e. don't take the lock for notifications that
don't affect the guest.

For small VMs, spurious locking is a minor annoyance.  And for "volatile"
setups where the majority of notifications _are_ relevant, this barely
qualifies as an optimization.

But, for large VMs (hundreds of threads) with static setups, e.g. no
page migration, no swapping, etc..., the vast majority of MMU notifier
callbacks will be unrelated to the guest, e.g. will often be in response
to the userspace VMM adjusting its own virtual address space.  In such
large VMs, acquiring mmu_lock can be painful as it blocks vCPUs from
handling page faults.  In some scenarios it can even be "fatal" in the
sense that it causes unacceptable brownouts, e.g. when rebuilding huge
pages after live migration, a significant percentage of vCPUs will be
attempting to handle page faults.

x86's TDP MMU implementation is especially susceptible to spurious
locking due it taking mmu_lock for read when handling page faults.
Because rwlock is fair, a single writer will stall future readers, while
the writer is itself stalled waiting for in-progress readers to complete.
This is exacerbated by the MMU notifiers often firing multiple times in
quick succession, e.g. moving a page will (always?) invoke three separate
notifiers: .invalidate_range_start(), invalidate_range_end(), and
.change_pte().  Unnecessarily taking mmu_lock each time means even a
single spurious sequence can be problematic.

Note, this optimizes only the unpaired callbacks.  Optimizing the
.invalidate_range_{start,end}() pairs is more complex and will be done in
a future patch.

Suggested-by: Ben Gardon 
Signed-off-by: Sean Christopherson 
---
 virt/kvm/kvm_main.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 25ecb5235e17..f6697ad741ed 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -482,10 +482,10 @@ static void kvm_null_fn(void)
 static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
  const struct kvm_hva_range 
*range)
 {
+   bool ret = false, locked = false;
struct kvm_gfn_range gfn_range;
struct kvm_memory_slot *slot;
struct kvm_memslots *slots;
-   bool ret = false;
int i, idx;
 
/* A null handler is allowed if and only if on_lock() is provided. */
@@ -493,11 +493,13 @@ static __always_inline int __kvm_handle_hva_range(struct 
kvm *kvm,
 IS_KVM_NULL_FN(range->handler)))
return 0;
 
-   KVM_MMU_LOCK(kvm);
-
idx = srcu_read_lock(>srcu);
 
+   /* The on_lock() path does not yet support lock elision. */
if (!IS_KVM_NULL_FN(range->on_lock)) {
+   locked = true;
+   KVM_MMU_LOCK(kvm);
+
range->on_lock(kvm, range->start, range->end);
 
if (IS_KVM_NULL_FN(range->handler))
@@ -532,6 +534,10 @@ static __always_inline int __kvm_handle_hva_range(struct 
kvm *kvm,
gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE 
- 1, slot);
gfn_range.slot = slot;
 
+   if (!locked) {
+   locked = true;
+   KVM_MMU_LOCK(kvm);
+   }
ret |= range->handler(kvm, _range);
}
}
@@ -540,7 +546,8 @@ static __always_inline int __kvm_handle_hva_range(struct 
kvm *kvm,
kvm_flush_remote_tlbs(kvm);
 
 out_unlock:
-   KVM_MMU_UNLOCK(kvm);
+   if (locked)
+   KVM_MMU_UNLOCK(kvm);
 
srcu_read_unlock(>srcu, idx);
 
-- 
2.31.0.208.g409f899ff0-goog

[PATCH v2 07/10] KVM: Move MMU notifier's mmu_lock acquisition into common helper

2021-04-01 Thread Sean Christopherson

Acquire and release mmu_lock in the __kvm_handle_hva_range() helper
instead of requiring the caller to do the same.  This paves the way for
future patches to take mmu_lock if and only if an overlapping memslot is
found, without also having to introduce the on_lock() shenanigans used
to manipulate the notifier count and sequence.

No functional change intended.

Signed-off-by: Sean Christopherson 
---

Note, the WARN_ON_ONCE that asserts on_lock and handler aren't both null
is optimized out of all functions on recent gcc (for x86).  I wanted to
make it a BUILD_BUG_ON, but older versions of gcc aren't agressive/smart
enough to optimize it out, and using __builtin_constant_p() to get it to
build on older compilers prevents the assertion from firing on newer
compilers when given bad input.

I'm also a-ok dropping the check altogether, it just felt wrong having
the semi-funky on_lock -> !handler combo without documenting that handler
isn't allowed to be null in the common case.

 virt/kvm/kvm_main.c | 125 +---
 1 file changed, 82 insertions(+), 43 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2e809d73c7f1..25ecb5235e17 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -453,28 +453,57 @@ static void kvm_mmu_notifier_invalidate_range(struct 
mmu_notifier *mn,
 
 typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
+typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
+unsigned long end);
+
 struct kvm_hva_range {
unsigned long start;
unsigned long end;
pte_t pte;
hva_handler_t handler;
+   on_lock_fn_t on_lock;
bool flush_on_ret;
bool may_block;
 };
 
+/*
+ * Use a dedicated stub instead of NULL to indicate that there is no callback
+ * function/handler.  The compiler technically can't guarantee that a real
+ * function will have a non-zero address, and so it will generate code to
+ * check for !NULL, whereas comparing against a stub will be elided at compile
+ * time (unless the compiler is getting long in the tooth, e.g. gcc 4.9).
+ */
+static void kvm_null_fn(void)
+{
+
+}
+#define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
+
 static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
  const struct kvm_hva_range 
*range)
 {
-   struct kvm_memory_slot *slot;
-   struct kvm_memslots *slots;
struct kvm_gfn_range gfn_range;
+   struct kvm_memory_slot *slot;
+   struct kvm_memslots *slots;
bool ret = false;
int i, idx;
 
-   lockdep_assert_held_write(>mmu_lock);
+   /* A null handler is allowed if and only if on_lock() is provided. */
+   if (WARN_ON_ONCE(IS_KVM_NULL_FN(range->on_lock) &&
+IS_KVM_NULL_FN(range->handler)))
+   return 0;
+
+   KVM_MMU_LOCK(kvm);
 
idx = srcu_read_lock(>srcu);
 
+   if (!IS_KVM_NULL_FN(range->on_lock)) {
+   range->on_lock(kvm, range->start, range->end);
+
+   if (IS_KVM_NULL_FN(range->handler))
+   goto out_unlock;
+   }
+
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
slots = __kvm_memslots(kvm, i);
kvm_for_each_memslot(slot, slots) {
@@ -510,6 +539,9 @@ static __always_inline int __kvm_handle_hva_range(struct 
kvm *kvm,
if (range->flush_on_ret && (ret || kvm->tlbs_dirty))
kvm_flush_remote_tlbs(kvm);
 
+out_unlock:
+   KVM_MMU_UNLOCK(kvm);
+
srcu_read_unlock(>srcu, idx);
 
/* The notifiers are averse to booleans. :-( */
@@ -528,16 +560,12 @@ static __always_inline int kvm_handle_hva_range(struct 
mmu_notifier *mn,
.end= end,
.pte= pte,
.handler= handler,
+   .on_lock= (void *)kvm_null_fn,
.flush_on_ret   = true,
.may_block  = false,
};
-   int ret;
 
-   KVM_MMU_LOCK(kvm);
-   ret = __kvm_handle_hva_range(kvm, );
-   KVM_MMU_UNLOCK(kvm);
-
-   return ret;
+   return __kvm_handle_hva_range(kvm, );
 }
 
 static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier 
*mn,
@@ -551,16 +579,12 @@ static __always_inline int 
kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
.end= end,
.pte= __pte(0),
.handler= handler,
+   .on_lock= (void *)kvm_null_fn,
.flush_on_ret   = false,
.may_block  = false,
};
-   int ret;
 
-   KVM_MMU_LOCK(kvm);
-   ret = __kvm_handle_hva_range(kvm, );
-   KVM_MMU_UNLOCK(kvm);
-
-   return ret;
+   return __kvm_handle_hva_range(kvm, );
 }
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,

[PATCH v2 06/10] KVM: Kill off the old hva-based MMU notifier callbacks

2021-04-01 Thread Sean Christopherson

Yank out the hva-based MMU notifier APIs now that all architectures that
use the notifiers have moved to the gfn-based APIs.

No functional change intended.

Signed-off-by: Sean Christopherson 
---
 arch/arm64/include/asm/kvm_host.h   |  1 -
 arch/mips/include/asm/kvm_host.h|  1 -
 arch/powerpc/include/asm/kvm_host.h |  1 -
 arch/x86/include/asm/kvm_host.h |  1 -
 include/linux/kvm_host.h|  8 ---
 virt/kvm/kvm_main.c | 85 -
 6 files changed, 97 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 1ad729cf7b0d..72e6b4600264 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -582,7 +582,6 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
  struct kvm_vcpu_events *events);
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
-#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 
 void kvm_arm_halt_guest(struct kvm *kvm);
 void kvm_arm_resume_guest(struct kvm *kvm);
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 374a3c8806e8..feaa77036b67 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -967,7 +967,6 @@ enum kvm_mips_fault_result kvm_trap_emul_gva_fault(struct 
kvm_vcpu *vcpu,
   bool write);
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
-#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 
 /* Emulation */
 int kvm_get_inst(u32 *opc, struct kvm_vcpu *vcpu, u32 *out);
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 1335f0001bdd..1e83359f286b 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -55,7 +55,6 @@
 #include 
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
-#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 
 #define HPTEG_CACHE_NUM(1 << 15)
 #define HPTEG_HASH_BITS_PTE13
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a21e3698f4dc..99778ac51243 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1718,7 +1718,6 @@ asmlinkage void kvm_spurious_fault(void);
_ASM_EXTABLE(666b, 667b)
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
-#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 
 int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v);
 int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index e6bb401dd856..40ac2d40bb5a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -219,7 +219,6 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
 #ifdef KVM_ARCH_WANT_MMU_NOTIFIER
-#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 struct kvm_gfn_range {
struct kvm_memory_slot *slot;
gfn_t start;
@@ -231,13 +230,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct 
kvm_gfn_range *range);
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
-#else
-int kvm_unmap_hva_range(struct kvm *kvm,
-   unsigned long start, unsigned long end, unsigned flags);
-int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
-int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
-int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
-#endif /* KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS */
 #endif
 
 enum {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7a7e62ae5eb4..2e809d73c7f1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -451,8 +451,6 @@ static void kvm_mmu_notifier_invalidate_range(struct 
mmu_notifier *mn,
srcu_read_unlock(>srcu, idx);
 }
 
-#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
-
 typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
 struct kvm_hva_range {
@@ -564,8 +562,6 @@ static __always_inline int 
kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
 
return ret;
 }
-#endif /* KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS */
-
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address,
@@ -573,9 +569,6 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier 
*mn,
 {
struct kvm *kvm = mmu_notifier_to_kvm(mn);
 
-#ifndef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
-   int idx;
-#endif
trace_kvm_set_spte_hva(address);
 
/*
@@ -585,26 +578,13 @@ static void kvm_mmu_notifier_change_pte(struct 
mmu_notifier *mn,
 */
WARN_ON_ONCE(!kvm->mmu_notifier_count);
 
-#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
-#else
-   idx = srcu_read_lock(>srcu);
-
-   KVM_MMU_LOCK(kvm);
-

[PATCH v2 05/10] KVM: PPC: Convert to the gfn-based MMU notifier callbacks

2021-04-01 Thread Sean Christopherson

Move PPC to the gfn-base MMU notifier APIs, and update all 15 bajillion
PPC-internal hooks to work with gfns instead of hvas.

No meaningful functional change intended, though the exact order of
operations is slightly different since the memslot lookups occur before
calling into arch code.

Signed-off-by: Sean Christopherson 
---
 arch/powerpc/include/asm/kvm_book3s.h  | 12 ++--
 arch/powerpc/include/asm/kvm_host.h|  1 +
 arch/powerpc/include/asm/kvm_ppc.h |  9 ++-
 arch/powerpc/kvm/book3s.c  | 18 +++--
 arch/powerpc/kvm/book3s.h  | 10 ++-
 arch/powerpc/kvm/book3s_64_mmu_hv.c| 98 +++---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 25 +++
 arch/powerpc/kvm/book3s_hv.c   | 12 ++--
 arch/powerpc/kvm/book3s_pr.c   | 56 +--
 arch/powerpc/kvm/e500_mmu_host.c   | 27 +++
 10 files changed, 95 insertions(+), 173 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 2f5f919f6cd3..2d03f2930767 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -210,12 +210,12 @@ extern void kvmppc_free_pgtable_radix(struct kvm *kvm, 
pgd_t *pgd,
  unsigned int lpid);
 extern int kvmppc_radix_init(void);
 extern void kvmppc_radix_exit(void);
-extern int kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
-   unsigned long gfn);
-extern int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
-   unsigned long gfn);
-extern int kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
-   unsigned long gfn);
+extern bool kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
+   unsigned long gfn);
+extern bool kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
+ unsigned long gfn);
+extern bool kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot 
*memslot,
+  unsigned long gfn);
 extern long kvmppc_hv_get_dirty_log_radix(struct kvm *kvm,
struct kvm_memory_slot *memslot, unsigned long *map);
 extern void kvmppc_radix_flush_memslot(struct kvm *kvm,
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 1e83359f286b..1335f0001bdd 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -55,6 +55,7 @@
 #include 
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
+#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 
 #define HPTEG_CACHE_NUM(1 << 15)
 #define HPTEG_HASH_BITS_PTE13
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 8aacd76bb702..21ab0332eb42 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -281,11 +281,10 @@ struct kvmppc_ops {
 const struct kvm_memory_slot *old,
 const struct kvm_memory_slot *new,
 enum kvm_mr_change change);
-   int (*unmap_hva_range)(struct kvm *kvm, unsigned long start,
-  unsigned long end);
-   int (*age_hva)(struct kvm *kvm, unsigned long start, unsigned long end);
-   int (*test_age_hva)(struct kvm *kvm, unsigned long hva);
-   void (*set_spte_hva)(struct kvm *kvm, unsigned long hva, pte_t pte);
+   bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
+   bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
+   bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
+   bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
void (*free_memslot)(struct kvm_memory_slot *slot);
int (*init_vm)(struct kvm *kvm);
void (*destroy_vm)(struct kvm *kvm);
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 44bf567b6589..2b691f4d1f26 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -834,26 +834,24 @@ void kvmppc_core_commit_memory_region(struct kvm *kvm,
kvm->arch.kvm_ops->commit_memory_region(kvm, mem, old, new, change);
 }
 
-int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long 
end,
-   unsigned flags)
+bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-   return kvm->arch.kvm_ops->unmap_hva_range(kvm, start, end);
+   return kvm->arch.kvm_ops->unmap_gfn_range(kvm, range);
 }
 
-int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
+bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-   return kvm->arch.kvm_ops->age_hva(kvm, start, end);
+   return kvm->arch.kvm_ops->age_gfn(kvm, range);
 }
 
-int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
+bool kvm_test_age_gfn(struct

[PATCH v2 04/10] KVM: MIPS/MMU: Convert to the gfn-based MMU notifier callbacks

2021-04-01 Thread Sean Christopherson

Move MIPS to the gfn-based MMU notifier APIs, which do the hva->gfn
lookup in common code, and whose code is nearly identical to MIPS'
lookup.

No meaningful functional change intended, though the exact order of
operations is slightly different since the memslot lookups occur before
calling into arch code.

Signed-off-by: Sean Christopherson 
---
 arch/mips/include/asm/kvm_host.h |  1 +
 arch/mips/kvm/mmu.c  | 97 ++--
 2 files changed, 17 insertions(+), 81 deletions(-)

diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index feaa77036b67..374a3c8806e8 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -967,6 +967,7 @@ enum kvm_mips_fault_result kvm_trap_emul_gva_fault(struct 
kvm_vcpu *vcpu,
   bool write);
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
+#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 
 /* Emulation */
 int kvm_get_inst(u32 *opc, struct kvm_vcpu *vcpu, u32 *out);
diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
index 3dabeda82458..3dc885df2e32 100644
--- a/arch/mips/kvm/mmu.c
+++ b/arch/mips/kvm/mmu.c
@@ -439,85 +439,36 @@ static int kvm_mips_mkold_gpa_pt(struct kvm *kvm, gfn_t 
start_gfn,
  end_gfn << PAGE_SHIFT);
 }
 
-static int handle_hva_to_gpa(struct kvm *kvm,
-unsigned long start,
-unsigned long end,
-int (*handler)(struct kvm *kvm, gfn_t gfn,
-   gpa_t gfn_end,
-   struct kvm_memory_slot *memslot,
-   void *data),
-void *data)
+bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-   struct kvm_memslots *slots;
-   struct kvm_memory_slot *memslot;
-   int ret = 0;
-
-   slots = kvm_memslots(kvm);
-
-   /* we only care about the pages that the guest sees */
-   kvm_for_each_memslot(memslot, slots) {
-   unsigned long hva_start, hva_end;
-   gfn_t gfn, gfn_end;
-
-   hva_start = max(start, memslot->userspace_addr);
-   hva_end = min(end, memslot->userspace_addr +
-   (memslot->npages << PAGE_SHIFT));
-   if (hva_start >= hva_end)
-   continue;
-
-   /*
-* {gfn(page) | page intersects with [hva_start, hva_end)} =
-* {gfn_start, gfn_start+1, ..., gfn_end-1}.
-*/
-   gfn = hva_to_gfn_memslot(hva_start, memslot);
-   gfn_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, memslot);
-
-   ret |= handler(kvm, gfn, gfn_end, memslot, data);
-   }
-
-   return ret;
-}
-
-
-static int kvm_unmap_hva_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end,
-struct kvm_memory_slot *memslot, void *data)
-{
-   kvm_mips_flush_gpa_pt(kvm, gfn, gfn_end);
-   return 1;
-}
-
-int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long 
end,
-   unsigned flags)
-{
-   handle_hva_to_gpa(kvm, start, end, _unmap_hva_handler, NULL);
+   kvm_mips_flush_gpa_pt(kvm, range->start, range->end);
 
kvm_mips_callbacks->flush_shadow_all(kvm);
return 0;
 }
 
-static int kvm_set_spte_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end,
-   struct kvm_memory_slot *memslot, void *data)
+static bool __kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-   gpa_t gpa = gfn << PAGE_SHIFT;
-   pte_t hva_pte = *(pte_t *)data;
+   gpa_t gpa = range->start << PAGE_SHIFT;
+   pte_t hva_pte = range->pte;
pte_t *gpa_pte = kvm_mips_pte_for_gpa(kvm, NULL, gpa);
pte_t old_pte;
 
if (!gpa_pte)
-   return 0;
+   return false;
 
/* Mapping may need adjusting depending on memslot flags */
old_pte = *gpa_pte;
-   if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES && !pte_dirty(old_pte))
+   if (range->slot->flags & KVM_MEM_LOG_DIRTY_PAGES && !pte_dirty(old_pte))
hva_pte = pte_mkclean(hva_pte);
-   else if (memslot->flags & KVM_MEM_READONLY)
+   else if (range->slot->flags & KVM_MEM_READONLY)
hva_pte = pte_wrprotect(hva_pte);
 
set_pte(gpa_pte, hva_pte);
 
/* Replacing an absent or old page doesn't need flushes */
if (!pte_present(old_pte) || !pte_young(old_pte))
-   return 0;
+   return false;
 
/* Pages swapped, aged, moved, or cleaned require flushes */
return !pte_present(hva_pte) ||
@@ -526,27 +477,21 @@ static int kvm_set_spte_handler(struct kvm *kvm, gfn_t 
gfn, gfn_t gfn_end,
   (pte_dirty(old_pte) && !pte_dirty(hva_pte));
 }
 
-int

[PATCH v2 03/10] KVM: arm64: Convert to the gfn-based MMU notifier callbacks

2021-04-01 Thread Sean Christopherson

Move arm64 to the gfn-base MMU notifier APIs, which do the hva->gfn
lookup in common code.

No meaningful functional change intended, though the exact order of
operations is slightly different since the memslot lookups occur before
calling into arch code.

Signed-off-by: Sean Christopherson 
---
 arch/arm64/include/asm/kvm_host.h |   1 +
 arch/arm64/kvm/mmu.c  | 117 --
 2 files changed, 33 insertions(+), 85 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 72e6b4600264..1ad729cf7b0d 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -582,6 +582,7 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
  struct kvm_vcpu_events *events);
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
+#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 
 void kvm_arm_halt_guest(struct kvm *kvm);
 void kvm_arm_resume_guest(struct kvm *kvm);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 4b7e1e327337..35728231e9a0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -839,7 +839,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 * gfn_to_pfn_prot (which calls get_user_pages), so that we don't risk
 * the page we just got a reference to gets unmapped before we have a
 * chance to grab the mmu_lock, which ensure that if the page gets
-* unmapped afterwards, the call to kvm_unmap_hva will take it away
+* unmapped afterwards, the call to kvm_unmap_gfn will take it away
 * from us again properly. This smp_rmb() interacts with the smp_wmb()
 * in kvm_mmu_notifier_invalidate_.
 */
@@ -1064,123 +1064,70 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
return ret;
 }
 
-static int handle_hva_to_gpa(struct kvm *kvm,
-unsigned long start,
-unsigned long end,
-int (*handler)(struct kvm *kvm,
-   gpa_t gpa, u64 size,
-   void *data),
-void *data)
-{
-   struct kvm_memslots *slots;
-   struct kvm_memory_slot *memslot;
-   int ret = 0;
-
-   slots = kvm_memslots(kvm);
-
-   /* we only care about the pages that the guest sees */
-   kvm_for_each_memslot(memslot, slots) {
-   unsigned long hva_start, hva_end;
-   gfn_t gpa;
-
-   hva_start = max(start, memslot->userspace_addr);
-   hva_end = min(end, memslot->userspace_addr +
-   (memslot->npages << PAGE_SHIFT));
-   if (hva_start >= hva_end)
-   continue;
-
-   gpa = hva_to_gfn_memslot(hva_start, memslot) << PAGE_SHIFT;
-   ret |= handler(kvm, gpa, (u64)(hva_end - hva_start), data);
-   }
-
-   return ret;
-}
-
-static int kvm_unmap_hva_handler(struct kvm *kvm, gpa_t gpa, u64 size, void 
*data)
-{
-   unsigned flags = *(unsigned *)data;
-   bool may_block = flags & MMU_NOTIFIER_RANGE_BLOCKABLE;
-
-   __unmap_stage2_range(>arch.mmu, gpa, size, may_block);
-   return 0;
-}
-
-int kvm_unmap_hva_range(struct kvm *kvm,
-   unsigned long start, unsigned long end, unsigned flags)
+bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
if (!kvm->arch.mmu.pgt)
return 0;
 
-   handle_hva_to_gpa(kvm, start, end, _unmap_hva_handler, );
-   return 0;
-}
+   __unmap_stage2_range(>arch.mmu, range->start << PAGE_SHIFT,
+(range->end - range->start) << PAGE_SHIFT,
+range->may_block);
 
-static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, u64 size, void 
*data)
-{
-   kvm_pfn_t *pfn = (kvm_pfn_t *)data;
-
-   WARN_ON(size != PAGE_SIZE);
-
-   /*
-* The MMU notifiers will have unmapped a huge PMD before calling
-* ->change_pte() (which in turn calls kvm_set_spte_hva()) and
-* therefore we never need to clear out a huge PMD through this
-* calling path and a memcache is not required.
-*/
-   kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, gpa, PAGE_SIZE,
-  __pfn_to_phys(*pfn), KVM_PGTABLE_PROT_R, NULL);
return 0;
 }
 
-int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
+bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-   unsigned long end = hva + PAGE_SIZE;
-   kvm_pfn_t pfn = pte_pfn(pte);
+   kvm_pfn_t pfn = pte_pfn(range->pte);
 
if (!kvm->arch.mmu.pgt)
return 0;
 
+   WARN_ON(range->end - range->start != 1);
+
/*
 * We've moved a page around, probably through CoW, so let's treat it
 * just like a translation fault and clean the cache to the

[PATCH v2 02/10] KVM: Move x86's MMU notifier memslot walkers to generic code

2021-04-01 Thread Sean Christopherson

Move the hva->gfn lookup for MMU notifiers into common code.  Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.

In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.

The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.

Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.

Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.

MIPS, PPC, and arm64 will be converted one at a time in future patches.

Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm_host.h |   1 +
 arch/x86/kvm/mmu/mmu.c  | 127 +++--
 arch/x86/kvm/mmu/tdp_mmu.c  | 241 
 arch/x86/kvm/mmu/tdp_mmu.h  |  14 +-
 include/linux/kvm_host.h|  14 ++
 virt/kvm/kvm_main.c | 169 +-
 6 files changed, 317 insertions(+), 249 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 99778ac51243..a21e3698f4dc 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1718,6 +1718,7 @@ asmlinkage void kvm_spurious_fault(void);
_ASM_EXTABLE(666b, 667b)
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
+#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
 
 int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v);
 int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index efb41f31e80a..f2046c41eb93 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1298,26 +1298,25 @@ static bool kvm_zap_rmapp(struct kvm *kvm, struct 
kvm_rmap_head *rmap_head,
return flush;
 }
 
-static int kvm_unmap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
-  struct kvm_memory_slot *slot, gfn_t gfn, int level,
-  unsigned long data)
+static bool kvm_unmap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+   struct kvm_memory_slot *slot, gfn_t gfn, int level,
+   pte_t unused)
 {
return kvm_zap_rmapp(kvm, rmap_head, slot);
 }
 
-static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
-struct kvm_memory_slot *slot, gfn_t gfn, int level,
-unsigned long data)
+static bool kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+ struct kvm_memory_slot *slot, gfn_t gfn, int 
level,
+ pte_t pte)
 {
u64 *sptep;
struct rmap_iterator iter;
int need_flush = 0;
u64 new_spte;
-   pte_t *ptep = (pte_t *)data;
kvm_pfn_t new_pfn;
 
-   WARN_ON(pte_huge(*ptep));
-   new_pfn = pte_pfn(*ptep);
+   WARN_ON(pte_huge(pte));
+   new_pfn = pte_pfn(pte);
 
 restart:
for_each_rmap_spte(rmap_head, , sptep) {
@@ -1326,7 +1325,7 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, struct 
kvm_rmap_head *rmap_head,
 
need_flush = 1;
 
-   if (pte_write(*ptep)) {
+   if (pte_write(pte)) {
pte_list_remove(rmap_head, sptep);
goto restart;
} else {
@@ -1414,86 +1413,52 @@ static void slot_rmap_walk_next(struct 
slot_rmap_walk_iterator *iterator)
 slot_rmap_walk_okay(_iter_);   \
 slot_rmap_walk_next(_iter_))
 
-typedef int (*rmap_handler_t)(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
- struct kvm_memory_slot *slot, gfn_t gfn,
- int level, unsigned long data);
+typedef bool (*rmap_handler_t)(struct kvm *kvm, struct kvm_rmap_head 
*rmap_head,
+  struct kvm_memory_slot *slot, gfn_t gfn,
+  int level, pte_t pte);
 
-static __always_inline int kvm_handle_hva_range(struct kvm *kvm,
-   unsigned long start,
-   unsigned long end,
-   unsigned long data,
-   rmap_handler_t handler)
+static __always_inline bool kvm_handle_gfn_range(struct kvm *kvm,
+struct kvm_gfn_range *range,
+rmap_handler_t handler)
 {
-   struct kvm_memslots *slots;
-   struct kvm_memory_slot *memslot;
struct slot_rmap_walk_iterator iterator;
-   int ret = 0;

[PATCH v2 01/10] KVM: Assert that notifier count is elevated in .change_pte()

2021-04-01 Thread Sean Christopherson

In KVM's .change_pte() notification callback, replace the notifier
sequence bump with a WARN_ON assertion that the notifier count is
elevated.  An elevated count provides stricter protections than bumping
the sequence, and the sequence is guarnateed to be bumped before the
count hits zero.

When .change_pte() was added by commit 828502d30073 ("ksm: add
mmu_notifier set_pte_at_notify()"), bumping the sequence was necessary
as .change_pte() would be invoked without any surrounding notifications.

However, since commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify
with invalidate_range_start and invalidate_range_end"), all calls to
.change_pte() are guaranteed to be bookended by start() and end(), and
so are guaranteed to run with an elevated notifier count.

Note, wrapping .change_pte() with .invalidate_range_{start,end}() is a
bug of sorts, as invalidating the secondary MMU's (KVM's) PTE defeats
the purpose of .change_pte().  Every arch's kvm_set_spte_hva() assumes
.change_pte() is called when the relevant SPTE is present in KVM's MMU,
as the original goal was to accelerate Kernel Samepage Merging (KSM) by
updating KVM's SPTEs without requiring a VM-Exit (due to invalidating
the SPTE).  I.e. it means that .change_pte() is effectively dead code
on _all_ architectures.

x86 and MIPS are clearcut nops if the old SPTE is not-present, and that
is guaranteed due to the prior invalidation.  PPC simply unmaps the SPTE,
which again should be a nop due to the invalidation.  arm64 is a bit
murky, but it's also likely a nop because kvm_pgtable_stage2_map() is
called without a cache pointer, which means it will map an entry if and
only if an existing PTE was found.

For now, take advantage of the bug to simplify future consolidation of
KVMs's MMU notifier code.   Doing so will not greatly complicate fixing
.change_pte(), assuming it's even worth fixing.  .change_pte() has been
broken for 8+ years and no one has complained.  Even if there are
KSM+KVM users that care deeply about its performance, the benefits of
avoiding VM-Exits via .change_pte() need to be reevaluated to justify
the added complexity and testing burden.  Ripping out .change_pte()
entirely would be a lot easier.

Signed-off-by: Sean Christopherson 
---
 virt/kvm/kvm_main.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d1de843b7618..8df091950161 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -461,12 +461,17 @@ static void kvm_mmu_notifier_change_pte(struct 
mmu_notifier *mn,
 
trace_kvm_set_spte_hva(address);
 
+   /*
+* .change_pte() must be bookended by .invalidate_range_{start,end}(),
+* and so always runs with an elevated notifier count.  This obviates
+* the need to bump the sequence count.
+*/
+   WARN_ON_ONCE(!kvm->mmu_notifier_count);
+
idx = srcu_read_lock(>srcu);
 
KVM_MMU_LOCK(kvm);
 
-   kvm->mmu_notifier_seq++;
-
if (kvm_set_spte_hva(kvm, address, pte))
kvm_flush_remote_tlbs(kvm);
 
-- 
2.31.0.208.g409f899ff0-goog

[PATCH v2 00/10] KVM: Consolidate and optimize MMU notifiers

2021-04-01 Thread Sean Christopherson

The end goal of this series is to optimize the MMU notifiers to take
mmu_lock if and only if the notification is relevant to KVM, i.e. the hva
range overlaps a memslot.   Large VMs (hundreds of vCPUs) are very
sensitive to mmu_lock being taken for write at inopportune times, and
such VMs also tend to be "static", e.g. backed by HugeTLB with minimal
page shenanigans.  The vast majority of notifications for these VMs will
be spurious (for KVM), and eliding mmu_lock for spurious notifications
avoids an otherwise unacceptable disruption to the guest.

To get there without potentially degrading performance, e.g. due to
multiple memslot lookups, especially on non-x86 where the use cases are
largely unknown (from my perspective), first consolidate the MMU notifier
logic by moving the hva->gfn lookups into common KVM.

Based on kvm/queue, commit 5f986f748438 ("KVM: x86: dump_vmcs should
include the autoload/autostore MSR lists").

Well tested on Intel and AMD.  Compile tested for arm64, MIPS, PPC,
PPC e500, and s390.  Absolutely needs to be tested for real on non-x86,
I give it even odds that I introduced an off-by-one bug somewhere.

v2:
 - Drop the patches that have already been pushed to kvm/queue.
 - Drop two selftest changes that had snuck in via "git commit -a".
 - Add a patch to assert that mmu_notifier_count is elevated when
   .change_pte() runs. [Paolo]
 - Split out moving KVM_MMU_(UN)LOCK() to __kvm_handle_hva_range() to a
   separate patch.  Opted not to squash it with the introduction of the
   common hva walkers (patch 02), as that prevented sharing code between
   the old and new APIs. [Paolo]
 - Tweak the comment in kvm_vm_destroy() above the smashing of the new
   slots lock. [Paolo]
 - Make mmu_notifier_slots_lock unconditional to avoid #ifdefs. [Paolo]

v1:
 - https://lkml.kernel.org/r/20210326021957.1424875-1-sea...@google.com

Sean Christopherson (10):
  KVM: Assert that notifier count is elevated in .change_pte()
  KVM: Move x86's MMU notifier memslot walkers to generic code
  KVM: arm64: Convert to the gfn-based MMU notifier callbacks
  KVM: MIPS/MMU: Convert to the gfn-based MMU notifier callbacks
  KVM: PPC: Convert to the gfn-based MMU notifier callbacks
  KVM: Kill off the old hva-based MMU notifier callbacks
  KVM: Move MMU notifier's mmu_lock acquisition into common helper
  KVM: Take mmu_lock when handling MMU notifier iff the hva hits a
memslot
  KVM: Don't take mmu_lock for range invalidation unless necessary
  KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if
possible

 arch/arm64/kvm/mmu.c   | 117 +++--
 arch/mips/kvm/mmu.c|  97 ++--
 arch/powerpc/include/asm/kvm_book3s.h  |  12 +-
 arch/powerpc/include/asm/kvm_ppc.h |   9 +-
 arch/powerpc/kvm/book3s.c  |  18 +-
 arch/powerpc/kvm/book3s.h  |  10 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c|  98 ++--
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  25 +-
 arch/powerpc/kvm/book3s_hv.c   |  12 +-
 arch/powerpc/kvm/book3s_pr.c   |  56 ++---
 arch/powerpc/kvm/e500_mmu_host.c   |  27 +-
 arch/x86/kvm/mmu/mmu.c | 127 --
 arch/x86/kvm/mmu/tdp_mmu.c | 245 +++
 arch/x86/kvm/mmu/tdp_mmu.h |  14 +-
 include/linux/kvm_host.h   |  22 +-
 virt/kvm/kvm_main.c| 325 +++--
 16 files changed, 552 insertions(+), 662 deletions(-)

-- 
2.31.0.208.g409f899ff0-goog

Re: [PATCH 08/10] mm/vmscan: Consider anonymous pages without swap

2021-04-01 Thread Wei Xu

On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen  wrote:
>
>
> From: Keith Busch 
>
> Reclaim anonymous pages if a migration path is available now that
> demotion provides a non-swap recourse for reclaiming anon pages.
>
> Note that this check is subtly different from the
> anon_should_be_aged() checks.  This mechanism checks whether a
> specific page in a specific context *can* actually be reclaimed, given
> current swap space and cgroup limits
>
> anon_should_be_aged() is a much simpler and more preliminary check
> which just says whether there is a possibility of future reclaim.
>
> #Signed-off-by: Keith Busch 
> Cc: Keith Busch 
> Signed-off-by: Dave Hansen 
> Reviewed-by: Yang Shi 
> Cc: Wei Xu 
> Cc: David Rientjes 
> Cc: Huang Ying 
> Cc: Dan Williams 
> Cc: David Hildenbrand 
> Cc: osalvador 
>
> --
>
> Changes from Dave 10/2020:
>  * remove 'total_swap_pages' modification
>
> Changes from Dave 06/2020:
>  * rename reclaim_anon_pages()->can_reclaim_anon_pages()
>
> Note: Keith's Intel SoB is commented out because he is no
> longer at Intel and his @intel.com mail will bounce.
> ---
>
>  b/mm/vmscan.c |   35 ---
>  1 file changed, 32 insertions(+), 3 deletions(-)
>
> diff -puN mm/vmscan.c~0009-mm-vmscan-Consider-anonymous-pages-without-swap 
> mm/vmscan.c
> --- a/mm/vmscan.c~0009-mm-vmscan-Consider-anonymous-pages-without-swap  
> 2021-03-31 15:17:19.388000242 -0700
> +++ b/mm/vmscan.c   2021-03-31 15:17:19.407000242 -0700
> @@ -287,6 +287,34 @@ static bool writeback_throttling_sane(st
>  }
>  #endif
>
> +static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
> + int node_id)
> +{
> +   if (memcg == NULL) {
> +   /*
> +* For non-memcg reclaim, is there
> +* space in any swap device?
> +*/
> +   if (get_nr_swap_pages() > 0)
> +   return true;
> +   } else {
> +   /* Is the memcg below its swap limit? */
> +   if (mem_cgroup_get_nr_swap_pages(memcg) > 0)
> +   return true;
> +   }
> +
> +   /*
> +* The page can not be swapped.
> +*
> +* Can it be reclaimed from this node via demotion?
> +*/
> +   if (next_demotion_node(node_id) >= 0)
> +   return true;

When neither swap space nor RECLAIM_MIGRATE is enabled, but
next_demotion_node() is configured, inactive pages cannot be swapped out
nor demoted.  However, this check can still cause these pages to be sent
to shrink_page_list() (e.g., when can_reclaim_anon_pages() is called by
get_scan_count()) and make the THP pages being unnecessarily split there.

One fix would be to guard this next_demotion_node() check with the
RECLAIM_MIGRATE node_reclaim_mode check.  This RECLAIM_MIGRATE
check needs to be applied to other calls to next_demotion_node() in
vmscan.c as well.

> +
> +   /* No way to reclaim anon pages */
> +   return false;
> +}
> +
>  /*
>   * This misses isolated pages which are not accounted for to save counters.
>   * As the data only determines if reclaim or compaction continues, it is
> @@ -298,7 +326,7 @@ unsigned long zone_reclaimable_pages(str
>
> nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
> zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
> -   if (get_nr_swap_pages() > 0)
> +   if (can_reclaim_anon_pages(NULL, zone_to_nid(zone)))
> nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
> zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
>
> @@ -2323,6 +2351,7 @@ enum scan_balance {
>  static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>unsigned long *nr)
>  {
> +   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> unsigned long anon_cost, file_cost, total_cost;
> int swappiness = mem_cgroup_swappiness(memcg);
> @@ -2333,7 +2362,7 @@ static void get_scan_count(struct lruvec
> enum lru_list lru;
>
> /* If we have no swap space, do not bother scanning anon pages. */
> -   if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) {
> +   if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id)) {

Demotion of anon pages still depends on sc->may_swap.  Any thoughts on
decoupling
demotion from swapping more completely?

> scan_balance = SCAN_FILE;
> goto out;
> }
> @@ -2708,7 +2737,7 @@ static inline bool should_continue_recla
>  */
> pages_for_compaction = compact_gap(sc->order);
> inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
> -   if (get_nr_swap_pages() > 0)
> +   if (can_reclaim_anon_pages(NULL, pgdat->node_id))
> inactive_lru_pages += node_page_state(pgdat, 
>

Re: [PATCH V2 2/5] soc: qcom: dcc: Add driver support for Data Capture and Compare unit(DCC)

2021-04-01 Thread Stephen Boyd

Quoting schow...@codeaurora.org (2021-04-01 07:04:07)
> On 2021-03-30 01:35, Stephen Boyd wrote:
> > Quoting Souradeep Chowdhury (2021-03-25 01:02:33)
> >> diff --git a/drivers/soc/qcom/dcc.c b/drivers/soc/qcom/dcc.c
> >> new file mode 100644
> >> index 000..a55d8ca7
> >> --- /dev/null
> >> +++ b/drivers/soc/qcom/dcc.c
> >> @@ -0,0 +1,1549 @@
[..]
> > 
> >> +   void __iomem*base;
> >> +   u32 reg_size;
> >> +   struct device   *dev;
> >> +   struct mutexmutex;
> > 
> > In particular what this mutex is protecting.
> 
> Ack. The mutex is used to protect the access as well as manipulation of 
> the main instance of dcc_drvdata structure
> initialized during probe time. This structure contains the useful driver 
> data information and is set using the call
> platform_set_drvdata(pdev, drvdata) which links this data to the 
> platform device and hence needs to be protected via
> mutex locks. The same convention is followed across other similar 
> drivers exposing userspace like the llcc driver.

The region that the mutex is protecting seems quite large. That's
probably because I don't understand the driver.

> > 
> >> +
> >> +   mutex_lock(>mutex);
> >> +
> >> +   for (curr_list = 0; curr_list < drvdata->nr_link_list; 
> >> curr_list++) {
> >> +   if (!drvdata->enable[curr_list])
> >> +   continue;
> >> +   ll_cfg = dcc_readl(drvdata, DCC_LL_CFG(curr_list));
> >> +   tmp_ll_cfg = ll_cfg & ~BIT(9);
> >> +   dcc_writel(drvdata, tmp_ll_cfg, 
> >> DCC_LL_CFG(curr_list));
> >> +   dcc_writel(drvdata, 1, DCC_LL_SW_TRIGGER(curr_list));
> >> +   dcc_writel(drvdata, ll_cfg, DCC_LL_CFG(curr_list));
> >> +   }
> > 
> > Does the mutex need to be held while waiting for ready?
> 
> Yes, to maintain consistency because inside the dcc_ready function, 
> there is access to dcc_drvdata structure set
> on the platform device.

Is the drvdata going to be modified somewhere else?

> >> +
> >> +   dev_info(drvdata->dev, "All values written to 
> >> enable.\n");
> > 
> > Debug print?
> 
> Ack
> 
> > 
> >> +   /* Make sure all config is written in sram */
> >> +   mb();
> > 
> > This won't work as intended.
> 
> This was called to prevent instruction reordering if the driver runs on 
> multiple
> CPU cores. As the hardware manipulation has to be done sequentially 
> before the
> trigger is set. Kindly let me know the concern in this case.

Device I/O with the proper accessors is sequential even if the process
moves to a different CPU. Is that what you're worried about? The comment
says "make sure it is written to sram", which should be achieved by
reading some register back from the device after all the writes so that
the driver knows the writes have been posted to the device. I believe
this mb() is doing nothing.

> 
> > 
> >> +
> >> +   drvdata->enable[list] = true;
> >> +
> >> +   /* 5. Configure trigger */
> >> +   dcc_writel(drvdata, BIT(9), DCC_LL_CFG(list));
> >> +   }
> >> +
> >> +err:
> >> +   mutex_unlock(>mutex);
> >> +   return ret;
> >> +}
> >> +
> >> +static void dcc_disable(struct dcc_drvdata *drvdata)
> >> +{
> >> +   int curr_list;
> >> +
> >> +   mutex_lock(>mutex);
> >> +
> >> +   if (!dcc_ready(drvdata))
> >> +   dev_err(drvdata->dev, "DCC is not ready Disabling 
> >> DCC...\n");
> > 
> > Is that two sentences? And a debug print?
> 
> Ack.
> 
> > 
> >> +
> >> +   for (curr_list = 0; curr_list < drvdata->nr_link_list; 
> >> curr_list++) {
> >> +   if (!drvdata->enable[curr_list])
> >> +   continue;
> >> +   dcc_writel(drvdata, 0, DCC_LL_CFG(curr_list));
> >> +   dcc_writel(drvdata, 0, DCC_LL_BASE(curr_list));
> >> +   dcc_writel(drvdata, 0, DCC_FD_BASE(curr_list));
> >> +   dcc_writel(drvdata, 0, DCC_LL_LOCK(curr_list));
> >> +   drvdata->enable[curr_list] = false;
> >> +   }
> >> +   memset_io(drvdata->ram_base, 0, drvdata->ram_size);
> >> +   drvdata->ram_cfg = 0;
> >> +   drvdata->ram_start = 0;
> >> +   mutex_unlock(>mutex);
> >> +}
> >> +
> >> +static ssize_t curr_list_show(struct device *dev,
> >> +   struct device_attribute *attr, char *buf)
> >> +{
> >> +   int ret;
> >> +   struct dcc_drvdata *drvdata = dev_get_drvdata(dev);
> >> +
> >> +   mutex_lock(>mutex);
> >> +   if (drvdata->curr_list == DCC_INVALID_LINK_LIST) {
> >> +   dev_err(dev, "curr_list is not set.\n");
> >> +   ret = -EINVAL;
> >> +   goto err;
> >> +   }
> >> +
> >> +   ret = scnprintf(buf, PAGE_SIZE, "%d\n", drvdata->curr_list);
> >> +err:
> >> +   mutex_unlock(>mutex);
> >> +   return ret;
> >> +}
> >> +
> >> +static ssize_t curr_list_store(struct device *dev,
> >> +

Re: [PATCH 1/3] srcu: Remove superfluous ssp initialization on deferred work queue

2021-04-01 Thread Paul E. McKenney

On Fri, Apr 02, 2021 at 01:47:02AM +0200, Frederic Weisbecker wrote:
> When an ssp has already started a grace period and queued an early work
> to flush after SRCU workqueues are created, we expect the ssp to be
> properly initialized already. So we can skip this step at this stage.
> 
> Signed-off-by: Frederic Weisbecker 
> Cc: Boqun Feng 
> Cc: Lai Jiangshan 
> Cc: Neeraj Upadhyay 
> Cc: Josh Triplett 
> Cc: Joel Fernandes 
> Cc: Uladzislau Rezki 
> ---
>  kernel/rcu/srcutree.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 036ff5499ad5..7197156418e4 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -1396,7 +1396,6 @@ void __init srcu_init(void)
>   while (!list_empty(_boot_list)) {
>   ssp = list_first_entry(_boot_list, struct srcu_struct,
> work.work.entry);
> - check_init_srcu_struct(ssp);

You lost me on this one.  What happens if the only pre-initialization
invocation on the statically allocated srcu_struct pointed to by ssp
was call_srcu()?  I am not seeing how the initialization has already
happened in that case.

What am I missing here?

Thanx, Paul

>   list_del_init(>work.work.entry);
>   queue_work(rcu_gp_wq, >work.work);
>   }
> -- 
> 2.25.1
>

[PATCH v3 4/4] dt-bindings: serial: 8250: add aspeed,lpc-address and aspeed,sirq

2021-04-01 Thread Zev Weiss

These correspond to the existing lpc_address, sirq, and sirq_polarity
sysfs attributes; the second element of aspeed,sirq provides a
replacement for the deprecated aspeed,sirq-polarity-sense property.

Signed-off-by: Zev Weiss 
---
 .../devicetree/bindings/serial/8250.yaml  | 27 ---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/Documentation/devicetree/bindings/serial/8250.yaml 
b/Documentation/devicetree/bindings/serial/8250.yaml
index 491b9297432d..a6e01f9b745f 100644
--- a/Documentation/devicetree/bindings/serial/8250.yaml
+++ b/Documentation/devicetree/bindings/serial/8250.yaml
@@ -12,8 +12,13 @@ maintainers:
 allOf:
   - $ref: /schemas/serial.yaml#
   - if:
-  required:
-- aspeed,sirq-polarity-sense
+  anyOf:
+- required:
+- aspeed,lpc-address
+- required:
+- aspeed,sirq
+- required:
+- aspeed,sirq-polarity-sense
 then:
   properties:
 compatible:
@@ -190,6 +195,20 @@ properties:
   applicable to aspeed,ast2500-vuart.
 deprecated: true
 
+  aspeed,lpc-address:
+$ref: '/schemas/types.yaml#/definitions/uint32'
+description: |
+  The VUART LPC address.  Only applicable to aspeed,ast2500-vuart.
+
+  aspeed,sirq:
+$ref: "/schemas/types.yaml#/definitions/uint32-array"
+minItems: 2
+maxItems: 2
+description: |
+  A 2-cell property describing the VUART SIRQ number and SIRQ
+  polarity (IRQ_TYPE_LEVEL_LOW or IRQ_TYPE_LEVEL_HIGH).  Only
+  applicable to aspeed,ast2500-vuart.
+
 required:
   - reg
   - interrupts
@@ -221,6 +240,7 @@ examples:
 };
   - |
 #include 
+#include 
 serial@1e787000 {
 compatible = "aspeed,ast2500-vuart";
 reg = <0x1e787000 0x40>;
@@ -228,7 +248,8 @@ examples:
 interrupts = <8>;
 clocks = < ASPEED_CLK_APB>;
 no-loopback-test;
-aspeed,sirq-polarity-sense = < 0x70 25>;
+aspeed,lpc-address = <0x3f8>;
+aspeed,sirq = <4 IRQ_TYPE_LEVEL_LOW>;
 };
 
 ...
-- 
2.31.1

[PATCH v3 3/4] drivers/tty/serial/8250: add aspeed,lpc-address and aspeed,sirq DT properties

2021-04-01 Thread Zev Weiss

These allow describing all the Aspeed VUART attributes currently
available via sysfs.  aspeed,sirq provides a replacement for the
deprecated aspeed,sirq-polarity-sense property.

Signed-off-by: Zev Weiss 
---
 drivers/tty/serial/8250/8250_aspeed_vuart.c | 44 -
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/drivers/tty/serial/8250/8250_aspeed_vuart.c 
b/drivers/tty/serial/8250/8250_aspeed_vuart.c
index 8433f8dbb186..10b1f33386e6 100644
--- a/drivers/tty/serial/8250/8250_aspeed_vuart.c
+++ b/drivers/tty/serial/8250/8250_aspeed_vuart.c
@@ -28,6 +28,10 @@
 #define ASPEED_VUART_ADDRL 0x28
 #define ASPEED_VUART_ADDRH 0x2c
 
+#define ASPEED_VUART_DEFAULT_LPC_ADDR  0x3f8
+#define ASPEED_VUART_DEFAULT_SIRQ  4
+#define ASPEED_VUART_DEFAULT_SIRQ_POLARITY IRQ_TYPE_LEVEL_LOW
+
 struct aspeed_vuart {
struct device   *dev;
void __iomem*regs;
@@ -393,7 +397,8 @@ static int aspeed_vuart_probe(struct platform_device *pdev)
struct aspeed_vuart *vuart;
struct device_node *np;
struct resource *res;
-   u32 clk, prop;
+   u32 clk, prop, sirq[2];
+   bool sirq_polarity;
int rc;
 
np = pdev->dev.of_node;
@@ -501,6 +506,43 @@ static int aspeed_vuart_probe(struct platform_device *pdev)
of_node_put(sirq_polarity_sense_args.np);
}
 
+   rc = of_property_read_u32(np, "aspeed,lpc-address", );
+   if (rc < 0)
+   prop = ASPEED_VUART_DEFAULT_LPC_ADDR;
+
+   rc = aspeed_vuart_set_lpc_address(vuart, prop);
+   if (rc < 0) {
+   dev_err(>dev, "invalid value in aspeed,lpc-address 
property\n");
+   goto err_clk_disable;
+   }
+
+   rc = of_property_read_u32_array(np, "aspeed,sirq", sirq, 2);
+   if (rc < 0) {
+   sirq[0] = ASPEED_VUART_DEFAULT_SIRQ;
+   sirq[1] = ASPEED_VUART_DEFAULT_SIRQ_POLARITY;
+   }
+
+   rc = aspeed_vuart_set_sirq(vuart, sirq[0]);
+   if (rc < 0) {
+   dev_err(>dev, "invalid sirq number in aspeed,sirq 
property\n");
+   goto err_clk_disable;
+   }
+
+   switch (sirq[1]) {
+   case IRQ_TYPE_LEVEL_LOW:
+   sirq_polarity = false;
+   break;
+   case IRQ_TYPE_LEVEL_HIGH:
+   sirq_polarity = true;
+   break;
+   default:
+   dev_err(>dev, "invalid sirq polarity in aspeed,sirq 
property\n");
+   rc = -EINVAL;
+   goto err_clk_disable;
+   }
+
+   aspeed_vuart_set_sirq_polarity(vuart, sirq_polarity);
+
aspeed_vuart_set_enabled(vuart, true);
aspeed_vuart_set_host_tx_discard(vuart, true);
platform_set_drvdata(pdev, vuart);
-- 
2.31.1

[PATCH v3 2/4] drivers/tty/serial/8250: refactor sirq and lpc address setting code

2021-04-01 Thread Zev Weiss

This splits dedicated aspeed_vuart_set_{sirq,lpc_address}() functions
out of the sysfs store functions in preparation for adding DT
properties that will be poking the same registers.  While we're at it,
these functions now provide some basic bounds-checking on their
arguments.

Signed-off-by: Zev Weiss 
---
 drivers/tty/serial/8250/8250_aspeed_vuart.c | 51 ++---
 1 file changed, 35 insertions(+), 16 deletions(-)

diff --git a/drivers/tty/serial/8250/8250_aspeed_vuart.c 
b/drivers/tty/serial/8250/8250_aspeed_vuart.c
index c33e02cbde93..8433f8dbb186 100644
--- a/drivers/tty/serial/8250/8250_aspeed_vuart.c
+++ b/drivers/tty/serial/8250/8250_aspeed_vuart.c
@@ -72,22 +72,31 @@ static ssize_t lpc_address_show(struct device *dev,
return snprintf(buf, PAGE_SIZE - 1, "0x%x\n", addr);
 }
 
+static int aspeed_vuart_set_lpc_address(struct aspeed_vuart *vuart, u32 addr)
+{
+   if (addr > U16_MAX)
+   return -EINVAL;
+
+   writeb(addr >> 8, vuart->regs + ASPEED_VUART_ADDRH);
+   writeb(addr >> 0, vuart->regs + ASPEED_VUART_ADDRL);
+
+   return 0;
+}
+
 static ssize_t lpc_address_store(struct device *dev,
 struct device_attribute *attr,
 const char *buf, size_t count)
 {
struct aspeed_vuart *vuart = dev_get_drvdata(dev);
-   unsigned long val;
+   u32 val;
int err;
 
-   err = kstrtoul(buf, 0, );
+   err = kstrtou32(buf, 0, );
if (err)
return err;
 
-   writeb(val >> 8, vuart->regs + ASPEED_VUART_ADDRH);
-   writeb(val >> 0, vuart->regs + ASPEED_VUART_ADDRL);
-
-   return count;
+   err = aspeed_vuart_set_lpc_address(vuart, val);
+   return err ? : count;
 }
 
 static DEVICE_ATTR_RW(lpc_address);
@@ -105,27 +114,37 @@ static ssize_t sirq_show(struct device *dev,
return snprintf(buf, PAGE_SIZE - 1, "%u\n", reg);
 }
 
+static int aspeed_vuart_set_sirq(struct aspeed_vuart *vuart, u32 sirq)
+{
+   u8 reg;
+
+   if (sirq > (ASPEED_VUART_GCRB_HOST_SIRQ_MASK >> 
ASPEED_VUART_GCRB_HOST_SIRQ_SHIFT))
+   return -EINVAL;
+
+   sirq <<= ASPEED_VUART_GCRB_HOST_SIRQ_SHIFT;
+   sirq &= ASPEED_VUART_GCRB_HOST_SIRQ_MASK;
+
+   reg = readb(vuart->regs + ASPEED_VUART_GCRB);
+   reg &= ~ASPEED_VUART_GCRB_HOST_SIRQ_MASK;
+   reg |= sirq;
+   writeb(reg, vuart->regs + ASPEED_VUART_GCRB);
+
+   return 0;
+}
+
 static ssize_t sirq_store(struct device *dev, struct device_attribute *attr,
  const char *buf, size_t count)
 {
struct aspeed_vuart *vuart = dev_get_drvdata(dev);
unsigned long val;
int err;
-   u8 reg;
 
err = kstrtoul(buf, 0, );
if (err)
return err;
 
-   val <<= ASPEED_VUART_GCRB_HOST_SIRQ_SHIFT;
-   val &= ASPEED_VUART_GCRB_HOST_SIRQ_MASK;
-
-   reg = readb(vuart->regs + ASPEED_VUART_GCRB);
-   reg &= ~ASPEED_VUART_GCRB_HOST_SIRQ_MASK;
-   reg |= val;
-   writeb(reg, vuart->regs + ASPEED_VUART_GCRB);
-
-   return count;
+   err = aspeed_vuart_set_sirq(vuart, val);
+   return err ? : count;
 }
 
 static DEVICE_ATTR_RW(sirq);
-- 
2.31.1

[PATCH v3 1/4] dt-bindings: serial: 8250: deprecate aspeed,sirq-polarity-sense

2021-04-01 Thread Zev Weiss

This property ties SIRQ polarity to SCU register bits that don't
necessarily have any direct relationship to it; the only use of it
was removed in commit c82bf6e133d30e0f9172a20807814fa28aef0f67.

Signed-off-by: Zev Weiss 
Reviewed-by: Joel Stanley 
---
 Documentation/devicetree/bindings/serial/8250.yaml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/devicetree/bindings/serial/8250.yaml 
b/Documentation/devicetree/bindings/serial/8250.yaml
index f54cae9ff7b2..491b9297432d 100644
--- a/Documentation/devicetree/bindings/serial/8250.yaml
+++ b/Documentation/devicetree/bindings/serial/8250.yaml
@@ -188,6 +188,7 @@ properties:
   offset and bit number to identify how the SIRQ polarity should be
   configured. One possible data source is the LPC/eSPI mode bit. Only
   applicable to aspeed,ast2500-vuart.
+deprecated: true
 
 required:
   - reg
-- 
2.31.1

[PATCH v3 0/4] aspeed-vuart: generalized DT properties

2021-04-01 Thread Zev Weiss

This series generalizes the aspeed-vuart driver's device tree
properties to cover all the attributes it currently exposes via sysfs.

The aspeed,sirq-polarity-sense property was a bit of a design mistake
in that it ties Aspeed VUART SIRQ polarity to SCU register bits that
aren't really inherently related to it; the first patch in this series
deprecates it (though we hope to eventually remove it).

The rest of the series adds two new properties, aspeed,lpc-address and
aspeed,sirq.  The latter allows describing the SIRQ polarity (along
with the interrupt number) directly, providing a simpler replacement
for aspeed,sirq-polarity-sense.


Changes since v2 [0]:
 - expanded to also handle sirq number and lpc address in addition to
   sirq polarity
 - added default settings if DT properties not specified
 - refactored existing sysfs code slightly, adding range checks
 - cleaned up 'make dt_binding_check' warnings

Changes since v1 [1]:
 - deprecate and retain aspeed,sirq-polarity-sense instead of removing it
 - drop e3c246d4i dts addition from this series

[0] 
https://lore.kernel.org/openbmc/20210401005702.28271-1-...@bewilderbeest.net/
[1] https://lore.kernel.org/openbmc/20210330002338.335-1-...@bewilderbeest.net/

Zev Weiss (4):
  dt-bindings: serial: 8250: deprecate aspeed,sirq-polarity-sense
  drivers/tty/serial/8250: refactor sirq and lpc address setting code
  drivers/tty/serial/8250: add aspeed,lpc-address and aspeed,sirq DT
properties
  dt-bindings: serial: 8250: add aspeed,lpc-address and aspeed,sirq

 .../devicetree/bindings/serial/8250.yaml  | 28 +-
 drivers/tty/serial/8250/8250_aspeed_vuart.c   | 95 +++
 2 files changed, 103 insertions(+), 20 deletions(-)

-- 
2.31.1

[PATCH v7 1/2] Added AMS tsl2591 driver implementation

2021-04-01 Thread Joe Sandom

Driver implementation for AMS/TAOS tsl2591 ambient light sensor.

This driver supports configuration via device tree and sysfs.
Supported channels for raw infrared light intensity,
raw combined light intensity and illuminance in lux.
The driver additionally supports iio events on lower and
upper thresholds.

This is a very-high sensitivity light-to-digital converter that
transforms light intensity into a digital signal.

Datasheet: https://ams.com/tsl25911#tab/documents

Signed-off-by: Joe Sandom 
---

Changes in v7;
- Revert back to using plain numbers in register defines instead of BIT and 
GENMASK
- Changed from pre-increment of 'i' in for loops to post-increment
- Use iopoll.h - readx_poll_timeout macro for periodic poll on ALS status flag 
valid
- Remove the 0xFF masks on u8 types as they are redundant

 drivers/iio/light/Kconfig   |   11 +
 drivers/iio/light/Makefile  |1 +
 drivers/iio/light/tsl2591.c | 1224 +++
 3 files changed, 1236 insertions(+)
 create mode 100644 drivers/iio/light/tsl2591.c

diff --git a/drivers/iio/light/Kconfig b/drivers/iio/light/Kconfig
index 33ad4dd0b5c7..6a69a9a3577a 100644
--- a/drivers/iio/light/Kconfig
+++ b/drivers/iio/light/Kconfig
@@ -501,6 +501,17 @@ config TSL2583
  Provides support for the TAOS tsl2580, tsl2581 and tsl2583 devices.
  Access ALS data via iio, sysfs.
 
+config TSL2591
+tristate "TAOS TSL2591 ambient light sensor"
+depends on I2C
+help
+  Select Y here for support of the AMS/TAOS TSL2591 ambient light 
sensor,
+  featuring channels for combined visible + IR intensity and lux 
illuminance.
+  Access data via iio and sysfs. Supports iio_events.
+
+  To compile this driver as a module, select M: the
+  module will be called tsl2591.
+
 config TSL2772
tristate "TAOS TSL/TMD2x71 and TSL/TMD2x72 Family of light and 
proximity sensors"
depends on I2C
diff --git a/drivers/iio/light/Makefile b/drivers/iio/light/Makefile
index ea376deaca54..d10912faf964 100644
--- a/drivers/iio/light/Makefile
+++ b/drivers/iio/light/Makefile
@@ -48,6 +48,7 @@ obj-$(CONFIG_ST_UVIS25_SPI)   += st_uvis25_spi.o
 obj-$(CONFIG_TCS3414)  += tcs3414.o
 obj-$(CONFIG_TCS3472)  += tcs3472.o
 obj-$(CONFIG_TSL2583)  += tsl2583.o
+obj-$(CONFIG_TSL2591)  += tsl2591.o
 obj-$(CONFIG_TSL2772)  += tsl2772.o
 obj-$(CONFIG_TSL4531)  += tsl4531.o
 obj-$(CONFIG_US5182D)  += us5182d.o
diff --git a/drivers/iio/light/tsl2591.c b/drivers/iio/light/tsl2591.c
new file mode 100644
index ..14a405a96e3a
--- /dev/null
+++ b/drivers/iio/light/tsl2591.c
@@ -0,0 +1,1224 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2020 Joe Sandom 
+ *
+ * Datasheet: https://ams.com/tsl25911#tab/documents
+ *
+ * Device driver for the TAOS TSL2591. This is a very-high sensitivity
+ * light-to-digital converter that transforms light intensity into a digital
+ * signal.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+/* ADC integration time, field value to time in ms*/
+#define TSL2591_FVAL_TO_ATIME(x) (((x) + 1) * 100)
+/* ADC integration time, time in ms to field value */
+#define TSL2591_ATIME_TO_FVAL(x) (((x) / 100) - 1)
+
+/* TSL2591 register set */
+#define TSL2591_ENABLE  0x00
+#define TSL2591_CONTROL 0x01
+#define TSL2591_AILTL   0x04
+#define TSL2591_AILTH   0x05
+#define TSL2591_AIHTL   0x06
+#define TSL2591_AIHTH   0x07
+#define TSL2591_NP_AILTL0x08
+#define TSL2591_NP_AILTH0x09
+#define TSL2591_NP_AIHTL0x0A
+#define TSL2591_NP_AIHTH0x0B
+#define TSL2591_PERSIST 0x0C
+#define TSL2591_PACKAGE_ID  0x11
+#define TSL2591_DEVICE_ID   0x12
+#define TSL2591_STATUS  0x13
+#define TSL2591_C0_DATAL0x14
+#define TSL2591_C0_DATAH0x15
+#define TSL2591_C1_DATAL0x16
+#define TSL2591_C1_DATAH0x17
+
+/* TSL2591 command register definitions */
+#define TSL2591_CMD_NOP 0xA0
+#define TSL2591_CMD_SF_INTSET   0xE4
+#define TSL2591_CMD_SF_CALS_I   0xE5
+#define TSL2591_CMD_SF_CALS_NPI 0xE7
+#define TSL2591_CMD_SF_CNP_ALSI 0xEA
+
+/* TSL2591 enable register definitions */
+#define TSL2591_PWR_ON  0x01
+#define TSL2591_PWR_OFF 0x00
+#define TSL2591_ENABLE_ALS  0x02
+#define TSL2591_ENABLE_ALS_INT  0x10
+#define TSL2591_ENABLE_SLEEP_INT0x40
+#define TSL2591_ENABLE_NP_INT   0x80
+
+/* TSL2591 control register definitions */
+#define TSL2591_CTRL_ALS_INTEGRATION_100MS  0x00
+#define TSL2591_CTRL_ALS_INTEGRATION_200MS  0x01
+#define TSL2591_CTRL_ALS_INTEGRATION_300MS  0x02
+#define TSL2591_CTRL_ALS_INTEGRATION_400MS  0x03
+#define TSL2591_CTRL_ALS_INTEGRATION_500MS  0x04
+#define TSL2591_CTRL_ALS_INTEGRATION_600MS  0x05
+#define TSL2591_CTRL_ALS_LOW_GAIN   0x00
+#define

[PATCH v7 2/2] Added AMS tsl2591 device tree binding

2021-04-01 Thread Joe Sandom

Device tree binding for AMS/TAOS tsl2591 ambient light sensor.

This driver supports configuration via device tree and sysfs.
Supported channels for raw infrared light intensity,
raw combined light intensity and illuminance in lux.
The driver additionally supports iio events on lower and
upper thresholds.

This is a very-high sensitivity light-to-digital converter that
transforms light intensity into a digital signal.

Signed-off-by: Joe Sandom 
Reviewed-by: Rob Herring 
---
Changes in v7:
- No changes

Notes:
- Re-submitted to align the version with part 1 of the patch series

 .../bindings/iio/light/amstaos,tsl2591.yaml   | 50 +++
 1 file changed, 50 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/iio/light/amstaos,tsl2591.yaml

diff --git a/Documentation/devicetree/bindings/iio/light/amstaos,tsl2591.yaml 
b/Documentation/devicetree/bindings/iio/light/amstaos,tsl2591.yaml
new file mode 100644
index ..596a3bc770f4
--- /dev/null
+++ b/Documentation/devicetree/bindings/iio/light/amstaos,tsl2591.yaml
@@ -0,0 +1,50 @@
+# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/iio/light/amstaos,tsl2591.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: AMS/TAOS TSL2591 Ambient Light Sensor (ALS)
+
+maintainers:
+  - Joe Sandom 
+
+description: |
+  AMS/TAOS TSL2591 is a very-high sensitivity
+  light-to-digital converter that transforms light intensity into a digital
+  signal.
+
+properties:
+  compatible:
+const: amstaos,tsl2591
+
+  reg:
+maxItems: 1
+
+  interrupts:
+maxItems: 1
+description:
+  Interrupt (INT:Pin 2) Active low. Should be set to IRQ_TYPE_EDGE_FALLING.
+  interrupt is used to detect if the light intensity has fallen below
+  or reached above the configured threshold values.
+
+required:
+  - compatible
+  - reg
+
+additionalProperties: false
+
+examples:
+  - |
+#include 
+i2c {
+#address-cells = <1>;
+#size-cells = <0>;
+
+tsl2591@29 {
+compatible = "amstaos,tsl2591";
+reg = <0x29>;
+interrupts = <20 IRQ_TYPE_EDGE_FALLING>;
+   };
+};
+...
-- 
2.17.1

Re: [PATCH v6 1/2] Added AMS tsl2591 driver implementation

2021-04-01 Thread Joe Sandom

On Fri, Mar 26, 2021 at 01:01:57PM +0200, Andy Shevchenko wrote:
> On Fri, Mar 26, 2021 at 12:05 AM Joe Sandom  wrote:
> >
> > Driver implementation for AMS/TAOS tsl2591 ambient light sensor.
> >
> > This driver supports configuration via device tree and sysfs.
> > Supported channels for raw infrared light intensity,
> > raw combined light intensity and illuminance in lux.
> > The driver additionally supports iio events on lower and
> > upper thresholds.
> >
> > This is a very-high sensitivity light-to-digital converter that
> > transforms light intensity into a digital signal.
> 
> I'm under the impression that you ignored at least half of my comments

The majority of your comments were applied in V5 as far as I can see.
Some of them I recognised as optional at the time. I had another sweep
through and have seen value in enforcing a few of the other points you
mentioned. I've added them to V7 and will release shortly. Thanks for
the feedback Andy.

> [1]. Have you seen them?
> 
> [1]: 
> https://lore.kernel.org/linux-iio/cahp75vcsw2xxdh--rxan7xt0ju+qfw9c_va0ggrgpgpbua0...@mail.gmail.com/
> 
> Please. address and come again.
> NAK for this version, sorry.
> 
> -- 
> With Best Regards,
> Andy Shevchenko

Re: [syzbot] WARNING in bpf_test_run

2021-04-01 Thread Yonghong Song





On 4/1/21 3:05 PM, Yonghong Song wrote:



On 4/1/21 4:29 AM, syzbot wrote:

Hello,

syzbot found the following issue on:

HEAD commit:    36e79851 libbpf: Preserve empty DATASEC BTFs during 
static..

git tree:   bpf-next
console output: 
https://syzkaller.appspot.com/x/log.txt?x=1569bb06d0 
kernel config:  
https://syzkaller.appspot.com/x/.config?x=7eff0f22b8563a5f 
dashboard link: 
https://syzkaller.appspot.com/bug?extid=774c590240616eaa3423 
syz repro:  
https://syzkaller.appspot.com/x/repro.syz?x=17556b7cd0 
C reproducer:   
https://syzkaller.appspot.com/x/repro.c?x=1772be26d0 


The issue was bisected to:

commit 997acaf6b4b59c6a9c259740312a69ea549cc684
Author: Mark Rutland 
Date:   Mon Jan 11 15:37:07 2021 +

 lockdep: report broken irq restoration

bisection log:  
https://syzkaller.appspot.com/x/bisect.txt?x=10197016d0 
final oops: 
https://syzkaller.appspot.com/x/report.txt?x=12197016d0 
console output: 
https://syzkaller.appspot.com/x/log.txt?x=14197016d0 

IMPORTANT: if you fix the issue, please add the following tag to the 
commit:

Reported-by: syzbot+774c590240616eaa3...@syzkaller.appspotmail.com
Fixes: 997acaf6b4b5 ("lockdep: report broken irq restoration")

[ cut here ]
WARNING: CPU: 0 PID: 8725 at include/linux/bpf-cgroup.h:193 
bpf_cgroup_storage_set include/linux/bpf-cgroup.h:193 [inline]
WARNING: CPU: 0 PID: 8725 at include/linux/bpf-cgroup.h:193 
bpf_test_run+0x65e/0xaa0 net/bpf/test_run.c:109


I will look at this issue. Thanks!


Modules linked in:
CPU: 0 PID: 8725 Comm: syz-executor927 Not tainted 
5.12.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, 
BIOS Google 01/01/2011

RIP: 0010:bpf_cgroup_storage_set include/linux/bpf-cgroup.h:193 [inline]
RIP: 0010:bpf_test_run+0x65e/0xaa0 net/bpf/test_run.c:109
Code: e9 29 fe ff ff e8 b2 9d 3a fa 41 83 c6 01 bf 08 00 00 00 44 89 
f6 e8 51 a5 3a fa 41 83 fe 08 0f 85 74 fc ff ff e8 92 9d 3a fa <0f> 0b 
bd f0  ff e9 5c fd ff ff e8 81 9d 3a fa 83 c5 01 bf 08

RSP: 0018:c900017bfaf0 EFLAGS: 00010293
RAX:  RBX: c9f29000 RCX: 
RDX: 88801bc68000 RSI: 8739543e RDI: 0003
RBP: 0007 R08: 0008 R09: 0001
R10: 8739542f R11:  R12: dc00
R13: 888021dd54c0 R14: 0008 R15: 
FS:  7f00157d7700() GS:8880b9c0() 
knlGS:

CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f0015795718 CR3: 157ae000 CR4: 001506f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
  bpf_prog_test_run_skb+0xabc/0x1c70 net/bpf/test_run.c:628
  bpf_prog_test_run kernel/bpf/syscall.c:3132 [inline]
  __do_sys_bpf+0x218b/0x4f40 kernel/bpf/syscall.c:4411
  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46


Run on my qemu (4 cpus) with C reproducer and I cannot reproduce the 
result. It already ran 30 minutes and still running. Checked the code, 
it is just doing a lot of parallel bpf_prog_test_run's.


The failure is in the below WARN_ON_ONCE code:

175 static inline int bpf_cgroup_storage_set(struct bpf_cgroup_storage
176 
*storage[MAX_BPF_CGROUP_STORAGE_TYPE])

177 {
178 enum bpf_cgroup_storage_type stype;
179 int i, err = 0;
180
181 preempt_disable();
182 for (i = 0; i < BPF_CGROUP_STORAGE_NEST_MAX; i++) {
183 if 
(unlikely(this_cpu_read(bpf_cgroup_storage_info[i].task) != NULL))

184 continue;
185
186 this_cpu_write(bpf_cgroup_storage_info[i].task, 
current);

187 for_each_cgroup_storage_type(stype)
188 
this_cpu_write(bpf_cgroup_storage_info[i].storage[stype],

189storage[stype]);
190 goto out;
191 }
192 err = -EBUSY;
193 WARN_ON_ONCE(1);
194
195 out:
196 preempt_enable();
197 return err;
198 }

Basically it shows the stress test triggered a warning due to
limited kernel resource.


  entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x446199
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 
89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 
01 f0  73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48

RSP: 002b:7f00157d72f8 EFLAGS: 0246 ORIG_RAX: 0141
RAX: ffda RBX: 004cb440 RCX: 00446199
RDX: 0028 RSI: 2080 RDI: 000a
RBP: 0049b074 R08:  R09: 
R10:  R11: 0246 R12: f9abde7200f522cd
R13: 3952ddf3af240c07 R14: 1631e0d82d3fa99d R15: 004cb448


---
This report is generated by a bot. It may contain errors.
See 
https://goo.gl/tpsmEJ   
for more information about

[PATCH] f2fs: set checkpoint_merge by default

2021-04-01 Thread Jaegeuk Kim

Once we introduced checkpoint_merge, we've seen some contention w/o the option.
In order to avoid it, let's set it by default.

Signed-off-by: Jaegeuk Kim 
---
 fs/f2fs/super.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 14239e2b7ae7..c15800c3cdb1 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -1839,6 +1839,7 @@ static void default_options(struct f2fs_sb_info *sbi)
set_opt(sbi, EXTENT_CACHE);
set_opt(sbi, NOHEAP);
clear_opt(sbi, DISABLE_CHECKPOINT);
+   set_opt(sbi, MERGE_CHECKPOINT);
F2FS_OPTION(sbi).unusable_cap = 0;
sbi->sb->s_flags |= SB_LAZYTIME;
set_opt(sbi, FLUSH_MERGE);
-- 
2.31.0.208.g409f899ff0-goog

Re: [PATCH v2] powerpc/traps: Enhance readability for trap types

2021-04-01 Thread Nicholas Piggin

Excerpts from Segher Boessenkool's message of April 2, 2021 2:11 am:
> On Thu, Apr 01, 2021 at 10:55:58AM +0800, Xiongwei Song wrote:
>> Segher Boessenkool  于2021年4月1日周四 上午6:15写道：
>> 
>> > On Wed, Mar 31, 2021 at 08:58:17PM +1100, Michael Ellerman wrote:
>> > > So perhaps:
>> > >
>> > >   EXC_SYSTEM_RESET
>> > >   EXC_MACHINE_CHECK
>> > >   EXC_DATA_STORAGE
>> > >   EXC_DATA_SEGMENT
>> > >   EXC_INST_STORAGE
>> > >   EXC_INST_SEGMENT
>> > >   EXC_EXTERNAL_INTERRUPT
>> > >   EXC_ALIGNMENT
>> > >   EXC_PROGRAM_CHECK
>> > >   EXC_FP_UNAVAILABLE
>> > >   EXC_DECREMENTER
>> > >   EXC_HV_DECREMENTER
>> > >   EXC_SYSTEM_CALL
>> > >   EXC_HV_DATA_STORAGE
>> > >   EXC_PERF_MONITOR
>> >
>> > These are interrupt (vectors), not exceptions.  It doesn't matter all
>> > that much, but confusing things more isn't useful either!  There can be
>> > multiple exceptions that all can trigger the same interrupt.
>> >
>> >  When looking at the reference manual of e500 and e600 from NXP
>>  official, they call them as interrupts.While looking at the "The
>> Programming Environments"
>>  that is also from NXP, they call them exceptions. Looks like there is
>>  no explicit distinction between interrupts and exceptions.
> 
> The architecture documents have always called it interrupts.  The PEM
> says it calls them exceptions instead, but they are called interrupts in
> the architecture (and the PEM says that, too).
> 
>> Here is the "The Programming Environments" link:
>> https://www.nxp.com.cn/docs/en/user-guide/MPCFPE_AD_R1.pdf
> 
> That document is 24 years old.  The architecture is still published,
> new versions regularly.
> 
>> As far as I know, the values of interrupts or exceptions above are defined
>> explicitly in reference manual or the programming environments.
> 
> They are defined in the architecture.
> 
>> Could
>> you please provide more details about multiple exceptions with the same
>> interrupts?
> 
> The simplest example is 700, program interrupt.  There are many causes
> for it, including all the exceptions in FPSCR: VX, ZX, OX, UX, XX, and
> VX is actually divided into nine separate cases itself.  There also are
> the various causes of privileged instruction type program interrupts,
> and  the trap type program interrupt, but the FEX ones are most obvious
> here.

Also:

* Some interrupts have no corresponding exception (system call and 
system call vectored). This is not just semantics or a bug in the ISA
because it is different from other synchronous interrupts: instructions 
which cause exceptions (e.g., a page fault) do not complete before 
taking the interrupt whereas sc does.

* It's quite usual for an exception to not cause an interrupt 
immediately (MSR[EE]=0, HMEER) or never cause one and be cleared by 
other means (msgclr, mtDEC, mtHMER, etc).

* It's possible for an exception to cause different interrupts!
A decrementer exception usually causes a decrementer interrupt, but it
can cause a system reset interrupt if the processor was in a power
saving mode. A data storage exception can cause a DSI or HDSI interrupt
depending on LPCR settings, and many other examples.

So I agree with Segher on this. We should use interrupt for interrupts, 
reduce exception except where we really mean it, and move away from vec 
and trap (I've got this wrong in the past too I admit). We don't have to 
do it all immediately, but new code should go in this direction.

Thanks,
Nick

Re: [PATCH 7/9] sched: Cgroup core-scheduling interface

2021-04-01 Thread Josh Don

Thanks, allowing for multiple group cookies in a hierarchy is a nice
improvement.

> +   if (tgi != tg) {
> +   if (tgi->core_cookie || (tgi->core_parent && 
> tgi->core_parent != tg))
> +   continue;
> +
> +   tgi->core_parent = parent;
> +   tgi->core_cookie = 0;

core_cookie must already be 0, given the check above.

Re: [PATCH v6 00/12] lib/find_bit: fast path for small bitmaps

2021-04-01 Thread Andrew Morton

On Thu, 1 Apr 2021 12:50:31 +0300 Andy Shevchenko  
wrote:

> > I normally don't have a lot of material for asm-generic either, half
> > the time there are no pull requests at all for a given release. I would
> > expect future changes to the bitmap implementation to only need
> > an occasional bugfix, which could go through either the asm-generic
> > tree or through mm and doesn't need another separate pull request.
> >
> > If it turns out to be a tree that needs regular updates every time,
> > then having a top level repository in linux-next would be appropriate.
> 
> Agree. asm-generic may serve for this. My worries are solely about how
> much burden we add on Andrew's shoulders.

Is fine.  Saving other developers from having to maintain tiny trees is
a thing I do.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1549 matches

Mail list logo