Re: [PATCH] usb: dwc3: Fix the USB 3.0 hub detection bug after warm boot

2017-07-21 Thread gustavo panizzo

Hello,

On Thu, Jul 13, 2017 at 01:58:26PM +0800, Baolin Wang wrote:

Hi,

On 13 July 2017 at 07:20, gustavo panizzo  wrote:

Hello Wang

thanks for your response


On Wed, Jul 12, 2017 at 02:08:04PM +0800, Baolin Wang wrote:


Hi,

On 12 July 2017 at 11:52, gustavo panizzo  wrote:


The dwc3 could not release resources when the module is built-in
because this module does not have shutdown method. This causes the USB
3.0 hub is not able to detect after warm boot.

Original patch by Brian Kim, updated and submitted upstream by gustavo
panizzo.

Also see https://bugs.debian.org/843448

Signed-off-by: Brian Kim 
Signed-off-by: gustavo panizzo 
---
 drivers/usb/dwc3/core.c | 33 +
 1 file changed, 33 insertions(+)

diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
index 326b302fc440..f92dfe213d89 100644
--- a/drivers/usb/dwc3/core.c
+++ b/drivers/usb/dwc3/core.c
@@ -1259,6 +1259,38 @@ static int dwc3_probe(struct platform_device
*pdev)
return ret;
 }

+static void dwc3_shutdown(struct platform_device *pdev)
+{
+   struct dwc3 *dwc = platform_get_drvdata(pdev);
+   struct resource *res = platform_get_resource(pdev,
IORESOURCE_MEM, 0);
+
+   pm_runtime_get_sync(>dev);
+   /*
+* restore res->start back to its original value so that, in case
the
+* probe is deferred, we don't end up getting error in request
the
+* memory region the next time probe is called.
+*/
+   res->start -= DWC3_GLOBALS_REGS_START;
+
+   dwc3_debugfs_exit(dwc);
+   dwc3_core_exit_mode(dwc);
+   dwc3_event_buffers_cleanup(dwc);



What about dwc3_event_buffers_cleanup? should I remove it from
dwc3_shutdown()?
It is already in dwc3_core_exit()


I think so. We should avoid duplicate code.


+   dwc3_free_event_buffers(dwc);
+
+   usb_phy_set_suspend(dwc->usb2_phy, 1);
+   usb_phy_set_suspend(dwc->usb3_phy, 1);
+
+   phy_power_off(dwc->usb2_generic_phy);
+   phy_power_off(dwc->usb3_generic_phy);



We've done these in dwc3_core_exit().


This is the patch after testing on a Odroid XU4, on top of linux-next 
964bcc1b4f57028d56dace7d9bc5924f2eb43f36 which translates to 4.13.0-rc1-next-20170717+

I tested this patch for a week without problems with heavy USB and NIC usage.
Please consider merging it


Author: gustavo panizzo 
Date:   Wed Jul 12 11:26:55 2017 +0800

   usb: dwc3: Fix the USB 3.0 hub detection bug after warm boot
   
   The dwc3 could not release resources when the module is built-in

   because this module does not have shutdown method. This causes the USB
   3.0 hub is not able to detect after warm boot.
   
   Original patch by Brian Kim, updated and submitted upstream by gustavo

   panizzo.
   
   Also see https://bugs.debian.org/843448
   
   Signed-off-by: Brian Kim 

   Signed-off-by: gustavo panizzo 

diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
index 326b302fc440..09de37d47ee7 100644
--- a/drivers/usb/dwc3/core.c
+++ b/drivers/usb/dwc3/core.c
@@ -1259,6 +1259,32 @@ static int dwc3_probe(struct platform_device *pdev)
return ret;
}

+static void dwc3_shutdown(struct platform_device *pdev)
+{
+   struct dwc3 *dwc = platform_get_drvdata(pdev);
+   struct resource *res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+
+   pm_runtime_get_sync(>dev);
+   /*
+* restore res->start back to its original value so that, in case the
+* probe is deferred, we don't end up getting error in request the
+* memory region the next time probe is called.
+*/
+   res->start -= DWC3_GLOBALS_REGS_START;
+
+   dwc3_debugfs_exit(dwc);
+   dwc3_core_exit_mode(dwc);
+   dwc3_event_buffers_cleanup(dwc);
+   dwc3_free_event_buffers(dwc);
+
+   dwc3_core_exit(dwc);
+   dwc3_ulpi_exit(dwc);
+
+   pm_runtime_put_sync(>dev);
+   pm_runtime_allow(>dev);
+   pm_runtime_disable(>dev);
+}
+
static int dwc3_remove(struct platform_device *pdev)
{
struct dwc3 *dwc = platform_get_drvdata(pdev);
@@ -1488,6 +1514,7 @@ MODULE_DEVICE_TABLE(acpi, dwc3_acpi_match);
static struct platform_driver dwc3_driver = {
.probe  = dwc3_probe,
.remove = dwc3_remove,
+   .shutdown   = dwc3_shutdown,
.driver = {
.name   = "dwc3",
.of_match_table = of_match_ptr(of_dwc3_match),





+
+   dwc3_core_exit(dwc);
+   dwc3_ulpi_exit(dwc);
+
+   pm_runtime_put_sync(>dev);
+   pm_runtime_allow(>dev);
+   pm_runtime_disable(>dev);
+}
+
 static int dwc3_remove(struct platform_device *pdev)
 {
struct dwc3 *dwc = platform_get_drvdata(pdev);
@@ -1488,6 +1520,7 @@ MODULE_DEVICE_TABLE(acpi, dwc3_acpi_match);
 static struct platform_driver dwc3_driver = {
.probe   

Re: [PATCH] usb: dwc3: Fix the USB 3.0 hub detection bug after warm boot

2017-07-21 Thread gustavo panizzo

Hello,

On Thu, Jul 13, 2017 at 01:58:26PM +0800, Baolin Wang wrote:

Hi,

On 13 July 2017 at 07:20, gustavo panizzo  wrote:

Hello Wang

thanks for your response


On Wed, Jul 12, 2017 at 02:08:04PM +0800, Baolin Wang wrote:


Hi,

On 12 July 2017 at 11:52, gustavo panizzo  wrote:


The dwc3 could not release resources when the module is built-in
because this module does not have shutdown method. This causes the USB
3.0 hub is not able to detect after warm boot.

Original patch by Brian Kim, updated and submitted upstream by gustavo
panizzo.

Also see https://bugs.debian.org/843448

Signed-off-by: Brian Kim 
Signed-off-by: gustavo panizzo 
---
 drivers/usb/dwc3/core.c | 33 +
 1 file changed, 33 insertions(+)

diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
index 326b302fc440..f92dfe213d89 100644
--- a/drivers/usb/dwc3/core.c
+++ b/drivers/usb/dwc3/core.c
@@ -1259,6 +1259,38 @@ static int dwc3_probe(struct platform_device
*pdev)
return ret;
 }

+static void dwc3_shutdown(struct platform_device *pdev)
+{
+   struct dwc3 *dwc = platform_get_drvdata(pdev);
+   struct resource *res = platform_get_resource(pdev,
IORESOURCE_MEM, 0);
+
+   pm_runtime_get_sync(>dev);
+   /*
+* restore res->start back to its original value so that, in case
the
+* probe is deferred, we don't end up getting error in request
the
+* memory region the next time probe is called.
+*/
+   res->start -= DWC3_GLOBALS_REGS_START;
+
+   dwc3_debugfs_exit(dwc);
+   dwc3_core_exit_mode(dwc);
+   dwc3_event_buffers_cleanup(dwc);



What about dwc3_event_buffers_cleanup? should I remove it from
dwc3_shutdown()?
It is already in dwc3_core_exit()


I think so. We should avoid duplicate code.


+   dwc3_free_event_buffers(dwc);
+
+   usb_phy_set_suspend(dwc->usb2_phy, 1);
+   usb_phy_set_suspend(dwc->usb3_phy, 1);
+
+   phy_power_off(dwc->usb2_generic_phy);
+   phy_power_off(dwc->usb3_generic_phy);



We've done these in dwc3_core_exit().


This is the patch after testing on a Odroid XU4, on top of linux-next 
964bcc1b4f57028d56dace7d9bc5924f2eb43f36 which translates to 4.13.0-rc1-next-20170717+

I tested this patch for a week without problems with heavy USB and NIC usage.
Please consider merging it


Author: gustavo panizzo 
Date:   Wed Jul 12 11:26:55 2017 +0800

   usb: dwc3: Fix the USB 3.0 hub detection bug after warm boot
   
   The dwc3 could not release resources when the module is built-in

   because this module does not have shutdown method. This causes the USB
   3.0 hub is not able to detect after warm boot.
   
   Original patch by Brian Kim, updated and submitted upstream by gustavo

   panizzo.
   
   Also see https://bugs.debian.org/843448
   
   Signed-off-by: Brian Kim 

   Signed-off-by: gustavo panizzo 

diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
index 326b302fc440..09de37d47ee7 100644
--- a/drivers/usb/dwc3/core.c
+++ b/drivers/usb/dwc3/core.c
@@ -1259,6 +1259,32 @@ static int dwc3_probe(struct platform_device *pdev)
return ret;
}

+static void dwc3_shutdown(struct platform_device *pdev)
+{
+   struct dwc3 *dwc = platform_get_drvdata(pdev);
+   struct resource *res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+
+   pm_runtime_get_sync(>dev);
+   /*
+* restore res->start back to its original value so that, in case the
+* probe is deferred, we don't end up getting error in request the
+* memory region the next time probe is called.
+*/
+   res->start -= DWC3_GLOBALS_REGS_START;
+
+   dwc3_debugfs_exit(dwc);
+   dwc3_core_exit_mode(dwc);
+   dwc3_event_buffers_cleanup(dwc);
+   dwc3_free_event_buffers(dwc);
+
+   dwc3_core_exit(dwc);
+   dwc3_ulpi_exit(dwc);
+
+   pm_runtime_put_sync(>dev);
+   pm_runtime_allow(>dev);
+   pm_runtime_disable(>dev);
+}
+
static int dwc3_remove(struct platform_device *pdev)
{
struct dwc3 *dwc = platform_get_drvdata(pdev);
@@ -1488,6 +1514,7 @@ MODULE_DEVICE_TABLE(acpi, dwc3_acpi_match);
static struct platform_driver dwc3_driver = {
.probe  = dwc3_probe,
.remove = dwc3_remove,
+   .shutdown   = dwc3_shutdown,
.driver = {
.name   = "dwc3",
.of_match_table = of_match_ptr(of_dwc3_match),





+
+   dwc3_core_exit(dwc);
+   dwc3_ulpi_exit(dwc);
+
+   pm_runtime_put_sync(>dev);
+   pm_runtime_allow(>dev);
+   pm_runtime_disable(>dev);
+}
+
 static int dwc3_remove(struct platform_device *pdev)
 {
struct dwc3 *dwc = platform_get_drvdata(pdev);
@@ -1488,6 +1520,7 @@ MODULE_DEVICE_TABLE(acpi, dwc3_acpi_match);
 static struct platform_driver dwc3_driver = {
.probe  = dwc3_probe,
.remove = dwc3_remove,
+   .shutdown   = dwc3_shutdown,
.driver = {

[RFC PATCH] exec: Avoid recursive modprobe for binary format handlers

2017-07-21 Thread Matt Redfearn
When the kernel does not have a binary format handler for an executable
it is attempting to load, when CONFIG_MODULES is enabled it will attempt
to load a module for that format. If the kernel does not have a binary
format handler for the modprobe executable, this will trigger another
module load. Previously this recursive module loading was caught and an
error message printed informing the user that the executable could not
be executed:

request_module: runaway loop modprobe binfmt-464c
Starting init:/sbin/init exists but couldn't execute it (error -8)

Commit 6d7964a722af ("kmod: throttle kmod thread limit") which was
merged in v4.13-rc1 broke this behaviour since the recursive modprobe is
no longer caught, it just ends up waiting indefinitely for the kmod_wq
wait queue. Hence the kernel appears to hang silently when starting
userspace.

This problem was observed when the binfmt handler for MIPS o32 binaries
is not built in to a 64bit kernel and the root filesystem is o32 ABI.

Catch this by adding a guard to search_binary_handler(). If there is no
binary format handler available to load an exectuable, and the
executable matches modprobe_path, i.e. the userspace helper that would
be executed to load a module, then do not attempt to load the module
since it will just end up here again when it fails to execute. This
actually improves the original behaviour since the "runaway loop"
warning is no longer printed, and we simply get:

Starting init:/sbin/init exists but couldn't execute it (error -8)

Fixes: 6d7964a722af ("kmod: throttle kmod thread limit")
Signed-off-by: Matt Redfearn 
---

What we really need to detect is that exec'ing modprobe failed, but
currently it does not get as far as an actual error since it just ends
up stuck waiting for the modprobes to complete, which they never will.
Open to suggestions of a different / better way to fix this.

Thanks,
Matt

---
 fs/exec.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index 62175cbcc801..004bb50a01fe 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1644,6 +1644,9 @@ int search_binary_handler(struct linux_binprm *bprm)
if (printable(bprm->buf[0]) && printable(bprm->buf[1]) &&
printable(bprm->buf[2]) && printable(bprm->buf[3]))
return retval;
+   /* Game over if we need to load a module to execute modprobe */
+   if (strcmp(bprm->filename, modprobe_path) == 0)
+   return retval;
if (request_module("binfmt-%04x", *(ushort *)(bprm->buf + 2)) < 
0)
return retval;
need_retry = false;
-- 
2.7.4



[RFC PATCH] exec: Avoid recursive modprobe for binary format handlers

2017-07-21 Thread Matt Redfearn
When the kernel does not have a binary format handler for an executable
it is attempting to load, when CONFIG_MODULES is enabled it will attempt
to load a module for that format. If the kernel does not have a binary
format handler for the modprobe executable, this will trigger another
module load. Previously this recursive module loading was caught and an
error message printed informing the user that the executable could not
be executed:

request_module: runaway loop modprobe binfmt-464c
Starting init:/sbin/init exists but couldn't execute it (error -8)

Commit 6d7964a722af ("kmod: throttle kmod thread limit") which was
merged in v4.13-rc1 broke this behaviour since the recursive modprobe is
no longer caught, it just ends up waiting indefinitely for the kmod_wq
wait queue. Hence the kernel appears to hang silently when starting
userspace.

This problem was observed when the binfmt handler for MIPS o32 binaries
is not built in to a 64bit kernel and the root filesystem is o32 ABI.

Catch this by adding a guard to search_binary_handler(). If there is no
binary format handler available to load an exectuable, and the
executable matches modprobe_path, i.e. the userspace helper that would
be executed to load a module, then do not attempt to load the module
since it will just end up here again when it fails to execute. This
actually improves the original behaviour since the "runaway loop"
warning is no longer printed, and we simply get:

Starting init:/sbin/init exists but couldn't execute it (error -8)

Fixes: 6d7964a722af ("kmod: throttle kmod thread limit")
Signed-off-by: Matt Redfearn 
---

What we really need to detect is that exec'ing modprobe failed, but
currently it does not get as far as an actual error since it just ends
up stuck waiting for the modprobes to complete, which they never will.
Open to suggestions of a different / better way to fix this.

Thanks,
Matt

---
 fs/exec.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index 62175cbcc801..004bb50a01fe 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1644,6 +1644,9 @@ int search_binary_handler(struct linux_binprm *bprm)
if (printable(bprm->buf[0]) && printable(bprm->buf[1]) &&
printable(bprm->buf[2]) && printable(bprm->buf[3]))
return retval;
+   /* Game over if we need to load a module to execute modprobe */
+   if (strcmp(bprm->filename, modprobe_path) == 0)
+   return retval;
if (request_module("binfmt-%04x", *(ushort *)(bprm->buf + 2)) < 
0)
return retval;
need_retry = false;
-- 
2.7.4



[GIT PULL] tracing: Minor updates for v4.13-rc1

2017-07-21 Thread Steven Rostedt

Linus,

Three minor updates

 - Use of the new GFP_RETRY_MAYFAIL to be more aggressive in allocating
   memory for the ring buffer without causing OOMs

 - Fix a memory leak in adding and removing instances

 - Add __rcu annotation to be able to debug RCU usage of function
   tracing a bit better.


Please pull the latest trace-v4.13-rc1 tree, which can be found at:


  git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
trace-v4.13-rc1

Tag SHA1: d0170e4f4f90fcc494c23d5d7e1ebb67533ffa3c
Head SHA1: f86f418059b94aa01f9342611a272ca60c583e89


Chunyan Zhang (1):
  trace: fix the errors caused by incompatible type of RCU variables

Chunyu Hu (1):
  tracing: Fix kmemleak in instance_rmdir

Joel Fernandes (1):
  tracing/ring_buffer: Try harder to allocate


 include/linux/ftrace.h   |  6 +++---
 include/linux/trace_events.h |  2 +-
 kernel/trace/ftrace.c| 41 +++--
 kernel/trace/ring_buffer.c   | 10 +-
 kernel/trace/trace.c |  1 +
 kernel/trace/trace.h |  6 +++---
 6 files changed, 40 insertions(+), 26 deletions(-)
---
diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 5857390ac35a..6383115e9d2c 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -145,8 +145,8 @@ enum {
 #ifdef CONFIG_DYNAMIC_FTRACE
 /* The hash used to know what functions callbacks trace */
 struct ftrace_ops_hash {
-   struct ftrace_hash  *notrace_hash;
-   struct ftrace_hash  *filter_hash;
+   struct ftrace_hash __rcu*notrace_hash;
+   struct ftrace_hash __rcu*filter_hash;
struct mutexregex_lock;
 };
 
@@ -168,7 +168,7 @@ static inline void ftrace_free_init_mem(void) { }
  */
 struct ftrace_ops {
ftrace_func_t   func;
-   struct ftrace_ops   *next;
+   struct ftrace_ops __rcu *next;
unsigned long   flags;
void*private;
ftrace_func_t   saved_func;
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index f73cedfa2e0b..536c80ff7ad9 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -338,7 +338,7 @@ enum {
 struct trace_event_file {
struct list_headlist;
struct trace_event_call *event_call;
-   struct event_filter *filter;
+   struct event_filter __rcu   *filter;
struct dentry   *dir;
struct trace_array  *tr;
struct trace_subsystem_dir  *system;
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 53f6b6401cf0..02004ae91860 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -113,7 +113,7 @@ static int ftrace_disabled __read_mostly;
 
 static DEFINE_MUTEX(ftrace_lock);
 
-static struct ftrace_ops *ftrace_ops_list __read_mostly = _list_end;
+static struct ftrace_ops __rcu *ftrace_ops_list __read_mostly = 
_list_end;
 ftrace_func_t ftrace_trace_function __read_mostly = ftrace_stub;
 static struct ftrace_ops global_ops;
 
@@ -169,8 +169,11 @@ int ftrace_nr_registered_ops(void)
 
mutex_lock(_lock);
 
-   for (ops = ftrace_ops_list;
-ops != _list_end; ops = ops->next)
+   for (ops = rcu_dereference_protected(ftrace_ops_list,
+lockdep_is_held(_lock));
+ops != _list_end;
+ops = rcu_dereference_protected(ops->next,
+lockdep_is_held(_lock)))
cnt++;
 
mutex_unlock(_lock);
@@ -275,10 +278,11 @@ static void update_ftrace_function(void)
 * If there's only one ftrace_ops registered, the ftrace_ops_list
 * will point to the ops we want.
 */
-   set_function_trace_op = ftrace_ops_list;
+   set_function_trace_op = rcu_dereference_protected(ftrace_ops_list,
+   lockdep_is_held(_lock));
 
/* If there's no ftrace_ops registered, just call the stub function */
-   if (ftrace_ops_list == _list_end) {
+   if (set_function_trace_op == _list_end) {
func = ftrace_stub;
 
/*
@@ -286,7 +290,8 @@ static void update_ftrace_function(void)
 * recursion safe and not dynamic and the arch supports passing ops,
 * then have the mcount trampoline call the function directly.
 */
-   } else if (ftrace_ops_list->next == _list_end) {
+   } else if (rcu_dereference_protected(ftrace_ops_list->next,
+   lockdep_is_held(_lock)) == _list_end) {
func = ftrace_ops_get_list_func(ftrace_ops_list);
 
} else {
@@ -348,9 +353,11 @@ int using_ftrace_ops_list_func(void)
return ftrace_trace_function == ftrace_ops_list_func;
 }
 
-static void add_ftrace_ops(struct ftrace_ops 

[GIT PULL] tracing: Minor updates for v4.13-rc1

2017-07-21 Thread Steven Rostedt

Linus,

Three minor updates

 - Use of the new GFP_RETRY_MAYFAIL to be more aggressive in allocating
   memory for the ring buffer without causing OOMs

 - Fix a memory leak in adding and removing instances

 - Add __rcu annotation to be able to debug RCU usage of function
   tracing a bit better.


Please pull the latest trace-v4.13-rc1 tree, which can be found at:


  git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
trace-v4.13-rc1

Tag SHA1: d0170e4f4f90fcc494c23d5d7e1ebb67533ffa3c
Head SHA1: f86f418059b94aa01f9342611a272ca60c583e89


Chunyan Zhang (1):
  trace: fix the errors caused by incompatible type of RCU variables

Chunyu Hu (1):
  tracing: Fix kmemleak in instance_rmdir

Joel Fernandes (1):
  tracing/ring_buffer: Try harder to allocate


 include/linux/ftrace.h   |  6 +++---
 include/linux/trace_events.h |  2 +-
 kernel/trace/ftrace.c| 41 +++--
 kernel/trace/ring_buffer.c   | 10 +-
 kernel/trace/trace.c |  1 +
 kernel/trace/trace.h |  6 +++---
 6 files changed, 40 insertions(+), 26 deletions(-)
---
diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 5857390ac35a..6383115e9d2c 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -145,8 +145,8 @@ enum {
 #ifdef CONFIG_DYNAMIC_FTRACE
 /* The hash used to know what functions callbacks trace */
 struct ftrace_ops_hash {
-   struct ftrace_hash  *notrace_hash;
-   struct ftrace_hash  *filter_hash;
+   struct ftrace_hash __rcu*notrace_hash;
+   struct ftrace_hash __rcu*filter_hash;
struct mutexregex_lock;
 };
 
@@ -168,7 +168,7 @@ static inline void ftrace_free_init_mem(void) { }
  */
 struct ftrace_ops {
ftrace_func_t   func;
-   struct ftrace_ops   *next;
+   struct ftrace_ops __rcu *next;
unsigned long   flags;
void*private;
ftrace_func_t   saved_func;
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index f73cedfa2e0b..536c80ff7ad9 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -338,7 +338,7 @@ enum {
 struct trace_event_file {
struct list_headlist;
struct trace_event_call *event_call;
-   struct event_filter *filter;
+   struct event_filter __rcu   *filter;
struct dentry   *dir;
struct trace_array  *tr;
struct trace_subsystem_dir  *system;
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 53f6b6401cf0..02004ae91860 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -113,7 +113,7 @@ static int ftrace_disabled __read_mostly;
 
 static DEFINE_MUTEX(ftrace_lock);
 
-static struct ftrace_ops *ftrace_ops_list __read_mostly = _list_end;
+static struct ftrace_ops __rcu *ftrace_ops_list __read_mostly = 
_list_end;
 ftrace_func_t ftrace_trace_function __read_mostly = ftrace_stub;
 static struct ftrace_ops global_ops;
 
@@ -169,8 +169,11 @@ int ftrace_nr_registered_ops(void)
 
mutex_lock(_lock);
 
-   for (ops = ftrace_ops_list;
-ops != _list_end; ops = ops->next)
+   for (ops = rcu_dereference_protected(ftrace_ops_list,
+lockdep_is_held(_lock));
+ops != _list_end;
+ops = rcu_dereference_protected(ops->next,
+lockdep_is_held(_lock)))
cnt++;
 
mutex_unlock(_lock);
@@ -275,10 +278,11 @@ static void update_ftrace_function(void)
 * If there's only one ftrace_ops registered, the ftrace_ops_list
 * will point to the ops we want.
 */
-   set_function_trace_op = ftrace_ops_list;
+   set_function_trace_op = rcu_dereference_protected(ftrace_ops_list,
+   lockdep_is_held(_lock));
 
/* If there's no ftrace_ops registered, just call the stub function */
-   if (ftrace_ops_list == _list_end) {
+   if (set_function_trace_op == _list_end) {
func = ftrace_stub;
 
/*
@@ -286,7 +290,8 @@ static void update_ftrace_function(void)
 * recursion safe and not dynamic and the arch supports passing ops,
 * then have the mcount trampoline call the function directly.
 */
-   } else if (ftrace_ops_list->next == _list_end) {
+   } else if (rcu_dereference_protected(ftrace_ops_list->next,
+   lockdep_is_held(_lock)) == _list_end) {
func = ftrace_ops_get_list_func(ftrace_ops_list);
 
} else {
@@ -348,9 +353,11 @@ int using_ftrace_ops_list_func(void)
return ftrace_trace_function == ftrace_ops_list_func;
 }
 
-static void add_ftrace_ops(struct ftrace_ops 

Re: [PATCH v2 1/2] livepatch: introduce shadow variable API

2017-07-21 Thread Joe Lawrence
On 07/21/2017 05:13 AM, Petr Mladek wrote:
> On Thu 2017-07-20 16:30:37, Joe Lawrence wrote:
>> Going back to existing kpatch use-cases, since we paired shadow variable
>> creation to their parent object creation, -EEXIST was never an issue.  I
>> think we concocted one proof-of-concept kpatch where we created shadow
>> variables "in-flight", that is, we patched a routine that operated on
>> the parent object and created a shadow variable if one did not already
>> exist.  The in-flight patch was for single function and we knew that it
>> would never be called concurrently for the same parent object.  tl;dr =
>> kpatch never worried about existing shadow .
> 
> I am not sure if you want to explain why you did not care. Or if
> you want to suggest that we should not care :-)

We knew that in our limited use-cases for in-flight shadow variables,
concurrency was not an issue.  Josh has a better historical perspective,
but I think this particular use-case appeared way after the initial
kpatch implementation of shadow variables.  Now that we know we can use
them in this way, I agree that it's important to hash out the
implications while designing the livepatch counterpart.

> I agree that if the API is used in simple/clear situations then
> this might look like an overkill. But I am afraid that the API users
> do not have this in hands. They usually have to create a livepatch
> based on an upstream secutity fix. The fix need not be always simple.
> Then it is handy to have an API that helps to catch mistakes
> and keeps the patched system in a sane state.
Very true, though I wonder what interesting state that will be when
running patched code and partially shadowed variables. :)  At the very
least, I think this protection would be valuable during patch sanity
testing.

>>> I would do WARN() in klp_shadow_attach() when the variable
>>> already existed are return NULL. Of course it might be inoncent
>>> duplication. But it might mean that someone else is using another
>>> variable of the same name but with different content. klp_shadow_get()
>>> would then return the same variable for two different purposes.
>>> Then the whole system might end like a glass on a stony floor.
>>
>> What do you think of expanding the API to include each the cases
>> outlined above?   Something like:
>>
>>   1 - klp_attach = allocate and add a unique  to the hash,
>>duplicates return NULL and a WARN
> 
> Sounds good.
> 
>>   2 - klp_get_or_attach = return  if it already exists,
>>   otherwise allocate a new one
> 
> Sounds good.
> 
>>   3 - klp_get_or_update = update and return  if it already
>>   exists, otherwise allocate a new one
> 
> I am not sure where this behavior would make sense. See below.
> 
> 
>> IMHO, I think cases 1 and 3 are most intuitive, so maybe case 2 should
>> be dropped.  Since you suggested adding klp_get_or_attach(), what do you
>> think?
> 
> I do not agree. Let's look at the example with the missing lock.
> The patch adds the lock if it did not exist. Then the lock can
> be used to synchronize all further operations.
> 
> klp_get_or_update() would always replace the existing lock
> with a freshly initialized one. We would loss the information
> if it was locked or not.

Ah good point, perhaps we have two situations here:

  A - A shadow variable that's pointing to some object, like a lock,
  where the original object is required.  (Your example above.)

  B - A shadow variable that's storing the data itself.  In other words,
  instead of attaching a pointer, the whole object was attached:

void patched_function()
{
   ...
   klp_get_or_attach(obj, id, , sizeof(jiffies), ...)
   ...

  in which case the caller is only interested in pushing in the
  latest version of jiffies.

For these I suggest klp_get_or_attach() for case A and
klp_get_or_update() for case B.

-- Joe


Re: [PATCH v2 1/2] livepatch: introduce shadow variable API

2017-07-21 Thread Joe Lawrence
On 07/21/2017 05:13 AM, Petr Mladek wrote:
> On Thu 2017-07-20 16:30:37, Joe Lawrence wrote:
>> Going back to existing kpatch use-cases, since we paired shadow variable
>> creation to their parent object creation, -EEXIST was never an issue.  I
>> think we concocted one proof-of-concept kpatch where we created shadow
>> variables "in-flight", that is, we patched a routine that operated on
>> the parent object and created a shadow variable if one did not already
>> exist.  The in-flight patch was for single function and we knew that it
>> would never be called concurrently for the same parent object.  tl;dr =
>> kpatch never worried about existing shadow .
> 
> I am not sure if you want to explain why you did not care. Or if
> you want to suggest that we should not care :-)

We knew that in our limited use-cases for in-flight shadow variables,
concurrency was not an issue.  Josh has a better historical perspective,
but I think this particular use-case appeared way after the initial
kpatch implementation of shadow variables.  Now that we know we can use
them in this way, I agree that it's important to hash out the
implications while designing the livepatch counterpart.

> I agree that if the API is used in simple/clear situations then
> this might look like an overkill. But I am afraid that the API users
> do not have this in hands. They usually have to create a livepatch
> based on an upstream secutity fix. The fix need not be always simple.
> Then it is handy to have an API that helps to catch mistakes
> and keeps the patched system in a sane state.
Very true, though I wonder what interesting state that will be when
running patched code and partially shadowed variables. :)  At the very
least, I think this protection would be valuable during patch sanity
testing.

>>> I would do WARN() in klp_shadow_attach() when the variable
>>> already existed are return NULL. Of course it might be inoncent
>>> duplication. But it might mean that someone else is using another
>>> variable of the same name but with different content. klp_shadow_get()
>>> would then return the same variable for two different purposes.
>>> Then the whole system might end like a glass on a stony floor.
>>
>> What do you think of expanding the API to include each the cases
>> outlined above?   Something like:
>>
>>   1 - klp_attach = allocate and add a unique  to the hash,
>>duplicates return NULL and a WARN
> 
> Sounds good.
> 
>>   2 - klp_get_or_attach = return  if it already exists,
>>   otherwise allocate a new one
> 
> Sounds good.
> 
>>   3 - klp_get_or_update = update and return  if it already
>>   exists, otherwise allocate a new one
> 
> I am not sure where this behavior would make sense. See below.
> 
> 
>> IMHO, I think cases 1 and 3 are most intuitive, so maybe case 2 should
>> be dropped.  Since you suggested adding klp_get_or_attach(), what do you
>> think?
> 
> I do not agree. Let's look at the example with the missing lock.
> The patch adds the lock if it did not exist. Then the lock can
> be used to synchronize all further operations.
> 
> klp_get_or_update() would always replace the existing lock
> with a freshly initialized one. We would loss the information
> if it was locked or not.

Ah good point, perhaps we have two situations here:

  A - A shadow variable that's pointing to some object, like a lock,
  where the original object is required.  (Your example above.)

  B - A shadow variable that's storing the data itself.  In other words,
  instead of attaching a pointer, the whole object was attached:

void patched_function()
{
   ...
   klp_get_or_attach(obj, id, , sizeof(jiffies), ...)
   ...

  in which case the caller is only interested in pushing in the
  latest version of jiffies.

For these I suggest klp_get_or_attach() for case A and
klp_get_or_update() for case B.

-- Joe


Re: [PATCH v7 06/16] lockdep: Detect and handle hist_lock ring buffer overwrite

2017-07-21 Thread Peter Zijlstra
On Fri, Jul 14, 2017 at 03:42:10PM +0900, Byungchul Park wrote:
> On Thu, Jul 13, 2017 at 08:23:33PM +0900, Byungchul Park wrote:
> > On Thu, Jul 13, 2017 at 8:12 PM, Peter Zijlstra  
> > wrote:
> > > On Thu, Jul 13, 2017 at 12:29:05PM +0200, Peter Zijlstra wrote:
> > >> On Thu, Jul 13, 2017 at 07:09:53PM +0900, Byungchul Park wrote:
> > >> > On Thu, Jul 13, 2017 at 11:50:52AM +0200, Peter Zijlstra wrote:
> > >> > >   wait_for_completion();
> > >> > > atomic_inc_return();
> > >> > >
> > >> > >   mutex_lock(A1);
> > >> > >   mutex_unlock(A1);
> > >> > >
> > >> > >
> > >> > >   
> > >> > > spin_lock(B1);
> > >> > > spin_unlock(B1);
> > >> > >
> > >> > > ...
> > >> > >
> > >> > > spin_lock(B64);
> > >> > > spin_unlock(B64);
> > >> > >   
> > >> > >
> > >> > >
> > >
> > > Also consider the alternative:
> > >
> > > 
> > >   spin_lock(D);
> > >   spin_unlock(D);
> > >
> > >   complete();
> > > 
> > >
> > > in which case the context test will also not work.
> > 
> > Context tests are done on xhlock with the release context, _not_
> > acquisition context. For example, spin_lock(D) and complete() are
> > in the same context, so the test would pass in this example.

The point was, this example will also link C to B*.

(/me copy paste from older email)

That gives:

xhist[ 0] = A1
xhist[ 1] = B1
...
xhist[63] = B63

then we wrap and have:

xhist[0] = B64

then we rewind to 1 and invalidate to arrive at:

xhist[ 0] = B64
xhist[ 1] = NULL   <-- idx
xhist[ 2] = B2
...
xhist[63] = B63


Then we do D and get

xhist[ 0] = B64
xhist[ 1] = D   <-- idx
xhist[ 2] = B2
...
xhist[63] = B63


And now there is nothing that will invalidate B*, after all, the
gen_id's are all after C's stamp, and the same_context_xhlock() test
will also pass because they're all from IRQ context (albeit not the
same, but it cannot tell).


Does this explain? Or am I still missing something?


Re: [PATCH v7 06/16] lockdep: Detect and handle hist_lock ring buffer overwrite

2017-07-21 Thread Peter Zijlstra
On Fri, Jul 14, 2017 at 03:42:10PM +0900, Byungchul Park wrote:
> On Thu, Jul 13, 2017 at 08:23:33PM +0900, Byungchul Park wrote:
> > On Thu, Jul 13, 2017 at 8:12 PM, Peter Zijlstra  
> > wrote:
> > > On Thu, Jul 13, 2017 at 12:29:05PM +0200, Peter Zijlstra wrote:
> > >> On Thu, Jul 13, 2017 at 07:09:53PM +0900, Byungchul Park wrote:
> > >> > On Thu, Jul 13, 2017 at 11:50:52AM +0200, Peter Zijlstra wrote:
> > >> > >   wait_for_completion();
> > >> > > atomic_inc_return();
> > >> > >
> > >> > >   mutex_lock(A1);
> > >> > >   mutex_unlock(A1);
> > >> > >
> > >> > >
> > >> > >   
> > >> > > spin_lock(B1);
> > >> > > spin_unlock(B1);
> > >> > >
> > >> > > ...
> > >> > >
> > >> > > spin_lock(B64);
> > >> > > spin_unlock(B64);
> > >> > >   
> > >> > >
> > >> > >
> > >
> > > Also consider the alternative:
> > >
> > > 
> > >   spin_lock(D);
> > >   spin_unlock(D);
> > >
> > >   complete();
> > > 
> > >
> > > in which case the context test will also not work.
> > 
> > Context tests are done on xhlock with the release context, _not_
> > acquisition context. For example, spin_lock(D) and complete() are
> > in the same context, so the test would pass in this example.

The point was, this example will also link C to B*.

(/me copy paste from older email)

That gives:

xhist[ 0] = A1
xhist[ 1] = B1
...
xhist[63] = B63

then we wrap and have:

xhist[0] = B64

then we rewind to 1 and invalidate to arrive at:

xhist[ 0] = B64
xhist[ 1] = NULL   <-- idx
xhist[ 2] = B2
...
xhist[63] = B63


Then we do D and get

xhist[ 0] = B64
xhist[ 1] = D   <-- idx
xhist[ 2] = B2
...
xhist[63] = B63


And now there is nothing that will invalidate B*, after all, the
gen_id's are all after C's stamp, and the same_context_xhlock() test
will also pass because they're all from IRQ context (albeit not the
same, but it cannot tell).


Does this explain? Or am I still missing something?


Re: linux-next: manual merge of the btrfs-kdave tree with Linus' tree

2017-07-21 Thread David Sterba
On Tue, Jul 18, 2017 at 10:18:02AM +1000, Stephen Rothwell wrote:
> Hi David,
> 
> Today's linux-next merge of the btrfs-kdave tree got a conflict in:
> 
>   fs/btrfs/extent_io.c
> 
> between commit:
> 
>   e6959b9350c6 ("btrfs: add support for passing in write hints for buffered 
> writes")
> 
> from Linus' tree and commit:
> 
>   41a3f2a7c062 ("btrfs: merge REQ_OP and REQ_ flags to one parameter in 
> submit_extent_page")
> 
> from the btrfs-kdave tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
> 
> You should probably consider rebasing your for-next branch onto (at
> least v4.13-rc1) (or merging v4.13-rc1) to save these sort of
> (unnecessary) conflicts being ongoing during development and the next
> merge window.

tl;dr I'm going to rebase for-next to 4.12 again

I've started rebasing on top of rc1 but found that several tests double
the run time. I've observed something similar already during the merge
window when testing master + pull request branch. The submit bio calls
were particularly visible in the stacks so I'm suspecting some block
layer related change or my system is misconfigured.

In order to be able to debug the problems further, I need a branch that
will reproduce the good results. For that reason I'll rebase back my
for-next branche to the 4.12-rc7. The following merge conflict will
reappear.

> diff --cc fs/btrfs/extent_io.c
> index 0aff9b278c19,ead9e731e01b..
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@@ -2803,8 -2801,7 +2805,8 @@@ static int submit_extent_page(unsigned 
>   bio_add_page(bio, page, page_size, offset);
>   bio->bi_end_io = end_io_func;
>   bio->bi_private = tree;
>  +bio->bi_write_hint = page->mapping->host->i_write_hint;
> - bio_set_op_attrs(bio, op, op_flags);
> + bio->bi_opf = opf;
>   if (wbc) {
>   wbc_init_bio(wbc, bio);
>   wbc_account_io(wbc, page, page_size);


Re: linux-next: manual merge of the btrfs-kdave tree with Linus' tree

2017-07-21 Thread David Sterba
On Tue, Jul 18, 2017 at 10:18:02AM +1000, Stephen Rothwell wrote:
> Hi David,
> 
> Today's linux-next merge of the btrfs-kdave tree got a conflict in:
> 
>   fs/btrfs/extent_io.c
> 
> between commit:
> 
>   e6959b9350c6 ("btrfs: add support for passing in write hints for buffered 
> writes")
> 
> from Linus' tree and commit:
> 
>   41a3f2a7c062 ("btrfs: merge REQ_OP and REQ_ flags to one parameter in 
> submit_extent_page")
> 
> from the btrfs-kdave tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
> 
> You should probably consider rebasing your for-next branch onto (at
> least v4.13-rc1) (or merging v4.13-rc1) to save these sort of
> (unnecessary) conflicts being ongoing during development and the next
> merge window.

tl;dr I'm going to rebase for-next to 4.12 again

I've started rebasing on top of rc1 but found that several tests double
the run time. I've observed something similar already during the merge
window when testing master + pull request branch. The submit bio calls
were particularly visible in the stacks so I'm suspecting some block
layer related change or my system is misconfigured.

In order to be able to debug the problems further, I need a branch that
will reproduce the good results. For that reason I'll rebase back my
for-next branche to the 4.12-rc7. The following merge conflict will
reappear.

> diff --cc fs/btrfs/extent_io.c
> index 0aff9b278c19,ead9e731e01b..
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@@ -2803,8 -2801,7 +2805,8 @@@ static int submit_extent_page(unsigned 
>   bio_add_page(bio, page, page_size, offset);
>   bio->bi_end_io = end_io_func;
>   bio->bi_private = tree;
>  +bio->bi_write_hint = page->mapping->host->i_write_hint;
> - bio_set_op_attrs(bio, op, op_flags);
> + bio->bi_opf = opf;
>   if (wbc) {
>   wbc_init_bio(wbc, bio);
>   wbc_account_io(wbc, page, page_size);


Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac

2017-07-21 Thread Borislav Petkov
On Fri, Jul 21, 2017 at 10:40:01AM -0300, Mauro Carvalho Chehab wrote:
> What happens when the error can be corrected? Does it still report it to
> userspace, or just silently hide the error?
> 
> If I remember well about a past discussion with some vendor, I was told
> that the firmware can hide some errors from being reported. Is it
> still the case?

I've heard the same thing but I have no idea what they're actually
doing. But it would make sense because the intention is not to worry
users unnecessarily if it can hide the error and if there are no adverse
consequences from it.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--


Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac

2017-07-21 Thread Borislav Petkov
On Fri, Jul 21, 2017 at 10:40:01AM -0300, Mauro Carvalho Chehab wrote:
> What happens when the error can be corrected? Does it still report it to
> userspace, or just silently hide the error?
> 
> If I remember well about a past discussion with some vendor, I was told
> that the firmware can hide some errors from being reported. Is it
> still the case?

I've heard the same thing but I have no idea what they're actually
doing. But it would make sense because the intention is not to worry
users unnecessarily if it can hide the error and if there are no adverse
consequences from it.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--


[PATCH v2 1/4] fs/dcache: Limit numbers of negative dentries

2017-07-21 Thread Waiman Long
The number of positive dentries is limited by the number of files
in the filesystems. The number of negative dentries, however,
has no limit other than the total amount of memory available in
the system. So a rogue application that generates a lot of negative
dentries can potentially exhaust most of the memory available in the
system impacting performance on other running applications.

To prevent this from happening, the dcache code is now updated to limit
the amount of the negative dentries in the LRU lists that can be kept
as a percentage of total available system memory. The default is 5%
and can be changed by specifying the "neg_dentry_pc=" kernel command
line option.

If the negative dentry limit is exceeded, newly created negative
dentries will be killed right after use to avoid adding unpredictable
latency to the directory lookup operation.

Signed-off-by: Waiman Long 
---
 Documentation/admin-guide/kernel-parameters.txt |   7 +
 fs/dcache.c | 245 
 include/linux/dcache.h  |   1 +
 3 files changed, 221 insertions(+), 32 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 372cc66..7f5497b 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2383,6 +2383,13 @@
 
n2= [NET] SDL Inc. RISCom/N2 synchronous serial card
 
+   neg_dentry_pc=  [KNL]
+   Range: 1-50
+   Default: 5
+   This parameter specifies the amount of negative
+   dentries allowed in the system as a percentage of
+   total system memory.
+
netdev= [NET] Network devices parameters
Format: 
Note that mem_start is often overloaded to mean
diff --git a/fs/dcache.c b/fs/dcache.c
index f901413..866df38 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -130,8 +130,19 @@ struct dentry_stat_t dentry_stat = {
.age_limit = 45,
 };
 
+/*
+ * Macros and variables to manage and count negative dentries.
+ */
+#define NEG_DENTRY_BATCH   (1 << 8)
+static long neg_dentry_percpu_limit __read_mostly;
+static struct {
+   raw_spinlock_t nfree_lock;
+   long nfree; /* Negative dentry free pool */
+} ndblk cacheline_aligned_in_smp;
+
 static DEFINE_PER_CPU(long, nr_dentry);
 static DEFINE_PER_CPU(long, nr_dentry_unused);
+static DEFINE_PER_CPU(long, nr_dentry_neg);
 
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
 
@@ -227,6 +238,86 @@ static inline int dentry_string_cmp(const unsigned char 
*cs, const unsigned char
 
 #endif
 
+/*
+ * There is a system-wide limit to the amount of negative dentries allowed
+ * in the super blocks' LRU lists. The default limit is 5% of the total
+ * system memory. This limit can be changed by using the kernel command line
+ * option "neg_dentry_pc=" to specify the percentage of the total memory
+ * that can be used for negative dentries. That percentage must be in the
+ * 1-50% range.
+ *
+ * To avoid performance problem with a global counter on an SMP system,
+ * the tracking is done mostly on a per-cpu basis. The total limit is
+ * distributed in a 80/20 ratio to per-cpu counters and a global free pool.
+ *
+ * If a per-cpu counter runs out of negative dentries, it can borrow extra
+ * ones from the global free pool. If it has more than its percpu limit,
+ * the extra ones will be returned back to the global pool.
+ */
+
+/*
+ * Decrement negative dentry count if applicable.
+ */
+static void __neg_dentry_dec(struct dentry *dentry)
+{
+   if (unlikely(this_cpu_dec_return(nr_dentry_neg) < 0)) {
+   long *pcnt = get_cpu_ptr(_dentry_neg);
+
+   if ((*pcnt < 0) && raw_spin_trylock(_lock)) {
+   ACCESS_ONCE(ndblk.nfree) += NEG_DENTRY_BATCH;
+   *pcnt += NEG_DENTRY_BATCH;
+   raw_spin_unlock(_lock);
+   }
+   put_cpu_ptr(_dentry_neg);
+   }
+}
+
+static inline void neg_dentry_dec(struct dentry *dentry)
+{
+   if (unlikely(d_is_negative(dentry)))
+   __neg_dentry_dec(dentry);
+}
+
+/*
+ * Increment negative dentry count if applicable.
+ */
+static void __neg_dentry_inc(struct dentry *dentry)
+{
+   long cnt, *pcnt;
+
+   if (this_cpu_inc_return(nr_dentry_neg) <= neg_dentry_percpu_limit)
+   return;
+
+   pcnt = get_cpu_ptr(_dentry_neg);
+   cnt  = (READ_ONCE(ndblk.nfree) &&
+  (*pcnt > neg_dentry_percpu_limit)) ? NEG_DENTRY_BATCH : 0;
+
+   if (cnt && raw_spin_trylock(_lock)) {
+   long val = READ_ONCE(ndblk.nfree);
+
+   if (val < cnt)
+   cnt = val;
+   ACCESS_ONCE(ndblk.nfree) -= cnt;
+   *pcnt -= cnt;
+  

[PATCH v2 1/4] fs/dcache: Limit numbers of negative dentries

2017-07-21 Thread Waiman Long
The number of positive dentries is limited by the number of files
in the filesystems. The number of negative dentries, however,
has no limit other than the total amount of memory available in
the system. So a rogue application that generates a lot of negative
dentries can potentially exhaust most of the memory available in the
system impacting performance on other running applications.

To prevent this from happening, the dcache code is now updated to limit
the amount of the negative dentries in the LRU lists that can be kept
as a percentage of total available system memory. The default is 5%
and can be changed by specifying the "neg_dentry_pc=" kernel command
line option.

If the negative dentry limit is exceeded, newly created negative
dentries will be killed right after use to avoid adding unpredictable
latency to the directory lookup operation.

Signed-off-by: Waiman Long 
---
 Documentation/admin-guide/kernel-parameters.txt |   7 +
 fs/dcache.c | 245 
 include/linux/dcache.h  |   1 +
 3 files changed, 221 insertions(+), 32 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 372cc66..7f5497b 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2383,6 +2383,13 @@
 
n2= [NET] SDL Inc. RISCom/N2 synchronous serial card
 
+   neg_dentry_pc=  [KNL]
+   Range: 1-50
+   Default: 5
+   This parameter specifies the amount of negative
+   dentries allowed in the system as a percentage of
+   total system memory.
+
netdev= [NET] Network devices parameters
Format: 
Note that mem_start is often overloaded to mean
diff --git a/fs/dcache.c b/fs/dcache.c
index f901413..866df38 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -130,8 +130,19 @@ struct dentry_stat_t dentry_stat = {
.age_limit = 45,
 };
 
+/*
+ * Macros and variables to manage and count negative dentries.
+ */
+#define NEG_DENTRY_BATCH   (1 << 8)
+static long neg_dentry_percpu_limit __read_mostly;
+static struct {
+   raw_spinlock_t nfree_lock;
+   long nfree; /* Negative dentry free pool */
+} ndblk cacheline_aligned_in_smp;
+
 static DEFINE_PER_CPU(long, nr_dentry);
 static DEFINE_PER_CPU(long, nr_dentry_unused);
+static DEFINE_PER_CPU(long, nr_dentry_neg);
 
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
 
@@ -227,6 +238,86 @@ static inline int dentry_string_cmp(const unsigned char 
*cs, const unsigned char
 
 #endif
 
+/*
+ * There is a system-wide limit to the amount of negative dentries allowed
+ * in the super blocks' LRU lists. The default limit is 5% of the total
+ * system memory. This limit can be changed by using the kernel command line
+ * option "neg_dentry_pc=" to specify the percentage of the total memory
+ * that can be used for negative dentries. That percentage must be in the
+ * 1-50% range.
+ *
+ * To avoid performance problem with a global counter on an SMP system,
+ * the tracking is done mostly on a per-cpu basis. The total limit is
+ * distributed in a 80/20 ratio to per-cpu counters and a global free pool.
+ *
+ * If a per-cpu counter runs out of negative dentries, it can borrow extra
+ * ones from the global free pool. If it has more than its percpu limit,
+ * the extra ones will be returned back to the global pool.
+ */
+
+/*
+ * Decrement negative dentry count if applicable.
+ */
+static void __neg_dentry_dec(struct dentry *dentry)
+{
+   if (unlikely(this_cpu_dec_return(nr_dentry_neg) < 0)) {
+   long *pcnt = get_cpu_ptr(_dentry_neg);
+
+   if ((*pcnt < 0) && raw_spin_trylock(_lock)) {
+   ACCESS_ONCE(ndblk.nfree) += NEG_DENTRY_BATCH;
+   *pcnt += NEG_DENTRY_BATCH;
+   raw_spin_unlock(_lock);
+   }
+   put_cpu_ptr(_dentry_neg);
+   }
+}
+
+static inline void neg_dentry_dec(struct dentry *dentry)
+{
+   if (unlikely(d_is_negative(dentry)))
+   __neg_dentry_dec(dentry);
+}
+
+/*
+ * Increment negative dentry count if applicable.
+ */
+static void __neg_dentry_inc(struct dentry *dentry)
+{
+   long cnt, *pcnt;
+
+   if (this_cpu_inc_return(nr_dentry_neg) <= neg_dentry_percpu_limit)
+   return;
+
+   pcnt = get_cpu_ptr(_dentry_neg);
+   cnt  = (READ_ONCE(ndblk.nfree) &&
+  (*pcnt > neg_dentry_percpu_limit)) ? NEG_DENTRY_BATCH : 0;
+
+   if (cnt && raw_spin_trylock(_lock)) {
+   long val = READ_ONCE(ndblk.nfree);
+
+   if (val < cnt)
+   cnt = val;
+   ACCESS_ONCE(ndblk.nfree) -= cnt;
+   *pcnt -= cnt;
+   

[PATCH v2 0/4] fs/dcache: Limit # of negative dentries

2017-07-21 Thread Waiman Long
 v1->v2:
  - Move the new nr_negative field to the end of dentry_stat_t structure
as suggested by Matthew Wilcox.
  - With the help of Miklos Szeredi, fix incorrect locking order in
dentry_kill() by using lock_parent() instead of locking the parent's
d_lock directly.
  - Correctly account for positive to negative dentry transitions.
  - Automatic pruning of negative dentries will now ignore the reference
bit in negative dentries but not the regular shrinking.

A rogue application can potentially create a large number of negative
dentries in the system consuming most of the memory available. This
can impact performance of other applications running on the system.

This patchset introduces changes to the dcache subsystem to limit
the number of negative dentries allowed to be created thus limiting
the amount of memory that can be consumed by negative dentries.

Patch 1 tracks the number of negative dentries used and disallow
the creation of more when the limit is reached.

Patch 2 enables /proc/sys/fs/dentry-state to report the number of
negative dentries in the system.

Patch 3 enables automatic pruning of negative dentries when it is
close to the limit so that we won't end up killing recently used
negative dentries.

Patch 4 prevents racing between negative dentry pruning and umount
operation.

Waiman Long (4):
  fs/dcache: Limit numbers of negative dentries
  fs/dcache: Report negative dentry number in dentry-state
  fs/dcache: Enable automatic pruning of negative dentries
  fs/dcache: Protect negative dentry pruning from racing with umount

 Documentation/admin-guide/kernel-parameters.txt |   7 +
 fs/dcache.c | 408 ++--
 include/linux/dcache.h  |   8 +-
 include/linux/list_lru.h|   1 +
 mm/list_lru.c   |   4 +-
 5 files changed, 392 insertions(+), 36 deletions(-)

-- 
1.8.3.1



[PATCH v2 3/4] fs/dcache: Enable automatic pruning of negative dentries

2017-07-21 Thread Waiman Long
Having a limit for the number of negative dentries does have an
undesirable side effect that no new negative dentries will be allowed
when the limit is reached. This will have performance implication
for some types of workloads.

So we need a way to prune the negative dentries so that new ones can
be created. This is done by using the workqueue API to do the pruning
gradually when a threshold is reached to minimize performance impact
on other running tasks. The pruning is done at a frequency of 10 runs
per second. Each run scans at most 256 LRU dentries for each node LRU
list of a certain superblock. Some non-negative dentries that happen
to be at the front of the LRU lists will also be pruned.

Signed-off-by: Waiman Long 
---
 fs/dcache.c  | 113 +++
 include/linux/list_lru.h |   1 +
 mm/list_lru.c|   4 +-
 3 files changed, 117 insertions(+), 1 deletion(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index d6d2d2d..c2ea876 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -134,13 +134,19 @@ struct dentry_stat_t dentry_stat = {
  * Macros and variables to manage and count negative dentries.
  */
 #define NEG_DENTRY_BATCH   (1 << 8)
+#define NEG_PRUNING_DELAY  (HZ/10)
 static long neg_dentry_percpu_limit __read_mostly;
 static long neg_dentry_nfree_init __read_mostly; /* Free pool initial value */
 static struct {
raw_spinlock_t nfree_lock;
long nfree; /* Negative dentry free pool */
+   struct super_block *prune_sb;   /* Super_block for pruning */
+   int neg_count, prune_count; /* Pruning counts */
 } ndblk cacheline_aligned_in_smp;
 
+static void prune_negative_dentry(struct work_struct *work);
+static DECLARE_DELAYED_WORK(prune_neg_dentry_work, prune_negative_dentry);
+
 static DEFINE_PER_CPU(long, nr_dentry);
 static DEFINE_PER_CPU(long, nr_dentry_unused);
 static DEFINE_PER_CPU(long, nr_dentry_neg);
@@ -323,6 +329,16 @@ static void __neg_dentry_inc(struct dentry *dentry)
 */
if (!cnt)
dentry->d_flags |= DCACHE_KILL_NEGATIVE;
+
+   /*
+* Initiate negative dentry pruning if free pool has less than
+* 1/4 of its initial value.
+*/
+   if (READ_ONCE(ndblk.nfree) < neg_dentry_nfree_init/4) {
+   WRITE_ONCE(ndblk.prune_sb, dentry->d_sb);
+   schedule_delayed_work(_neg_dentry_work,
+ NEG_PRUNING_DELAY);
+   }
 }
 
 static inline void neg_dentry_inc(struct dentry *dentry)
@@ -1320,6 +1336,103 @@ void shrink_dcache_sb(struct super_block *sb)
 }
 EXPORT_SYMBOL(shrink_dcache_sb);
 
+/*
+ * A modified version that attempts to remove a limited number of negative
+ * dentries as well as some other non-negative dentries at the front.
+ */
+static enum lru_status dentry_negative_lru_isolate(struct list_head *item,
+   struct list_lru_one *lru, spinlock_t *lru_lock, void *arg)
+{
+   struct list_head *freeable = arg;
+   struct dentry   *dentry = container_of(item, struct dentry, d_lru);
+   enum lru_status status = LRU_SKIP;
+
+   /*
+* Stop further list walking for the current node list to limit
+* performance impact, but allow further walking in the next node
+* list.
+*/
+   if ((ndblk.neg_count >= NEG_DENTRY_BATCH) ||
+   (ndblk.prune_count >= NEG_DENTRY_BATCH)) {
+   ndblk.prune_count = 0;
+   return LRU_STOP;
+   }
+
+   /*
+* we are inverting the lru lock/dentry->d_lock here,
+* so use a trylock. If we fail to get the lock, just skip
+* it
+*/
+   if (!spin_trylock(>d_lock)) {
+   ndblk.prune_count++;
+   return LRU_SKIP;
+   }
+
+   /*
+* Referenced dentries are still in use. If they have active
+* counts, just remove them from the LRU. Otherwise give them
+* another pass through the LRU.
+*/
+   if (dentry->d_lockref.count) {
+   d_lru_isolate(lru, dentry);
+   status = LRU_REMOVED;
+   goto out;
+   }
+
+   /*
+* Dentries with reference bit on are moved back to the tail
+* except for the negative ones.
+*/
+   if ((dentry->d_flags & DCACHE_REFERENCED) && !d_is_negative(dentry)) {
+   dentry->d_flags &= ~DCACHE_REFERENCED;
+   status = LRU_ROTATE;
+   goto out;
+   }
+
+   status = LRU_REMOVED;
+   d_lru_shrink_move(lru, dentry, freeable);
+   if (d_is_negative(dentry))
+   ndblk.neg_count++;
+out:
+   spin_unlock(>d_lock);
+   ndblk.prune_count++;
+   return status;
+}
+
+/*
+ * A workqueue function to prune negative dentry.
+ *
+ * The pruning is done gradually over time so as not to have noticeable
+ * performance impact.
+ */
+static void prune_negative_dentry(struct work_struct *work)
+{
+

[PATCH v2 4/4] fs/dcache: Protect negative dentry pruning from racing with umount

2017-07-21 Thread Waiman Long
The negative dentry pruning is done on a specific super_block set
in the ndblk.prune_sb variable. If the super_block is also being
un-mounted concurrently, the content of the super_block may no longer
be valid.

To protect against such racing condition, a new lock is added to
the ndblk structure to synchronize the negative dentry pruning and
umount operation. This is a regular spinlock as the pruning operation
can be quite time consuming.

Signed-off-by: Waiman Long 
---
 fs/dcache.c | 42 +++---
 1 file changed, 39 insertions(+), 3 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index c2ea876..a3159f3 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -139,11 +139,13 @@ struct dentry_stat_t dentry_stat = {
 static long neg_dentry_nfree_init __read_mostly; /* Free pool initial value */
 static struct {
raw_spinlock_t nfree_lock;
+   spinlock_t prune_lock;  /* Lock for protecting pruning */
long nfree; /* Negative dentry free pool */
struct super_block *prune_sb;   /* Super_block for pruning */
int neg_count, prune_count; /* Pruning counts */
 } ndblk cacheline_aligned_in_smp;
 
+static void clear_prune_sb_for_umount(struct super_block *sb);
 static void prune_negative_dentry(struct work_struct *work);
 static DECLARE_DELAYED_WORK(prune_neg_dentry_work, prune_negative_dentry);
 
@@ -1323,6 +1325,7 @@ void shrink_dcache_sb(struct super_block *sb)
 {
long freed;
 
+   clear_prune_sb_for_umount(sb);
do {
LIST_HEAD(dispose);
 
@@ -1353,7 +1356,8 @@ static enum lru_status dentry_negative_lru_isolate(struct 
list_head *item,
 * list.
 */
if ((ndblk.neg_count >= NEG_DENTRY_BATCH) ||
-   (ndblk.prune_count >= NEG_DENTRY_BATCH)) {
+   (ndblk.prune_count >= NEG_DENTRY_BATCH) ||
+   !READ_ONCE(ndblk.prune_sb)) {
ndblk.prune_count = 0;
return LRU_STOP;
}
@@ -1408,15 +1412,24 @@ static enum lru_status 
dentry_negative_lru_isolate(struct list_head *item,
 static void prune_negative_dentry(struct work_struct *work)
 {
int freed;
-   struct super_block *sb = READ_ONCE(ndblk.prune_sb);
+   struct super_block *sb;
LIST_HEAD(dispose);
 
-   if (!sb)
+   /*
+* The prune_lock is used to protect negative dentry pruning from
+* racing with concurrent umount operation.
+*/
+   spin_lock(_lock);
+   sb = READ_ONCE(ndblk.prune_sb);
+   if (!sb) {
+   spin_unlock(_lock);
return;
+   }
 
ndblk.neg_count = ndblk.prune_count = 0;
freed = list_lru_walk(>s_dentry_lru, dentry_negative_lru_isolate,
  , NEG_DENTRY_BATCH);
+   spin_unlock(_lock);
 
if (freed)
shrink_dentry_list();
@@ -1433,6 +1446,27 @@ static void prune_negative_dentry(struct work_struct 
*work)
WRITE_ONCE(ndblk.prune_sb, NULL);
 }
 
+/*
+ * This is called before an umount to clear ndblk.prune_sb if it
+ * matches the given super_block.
+ */
+static void clear_prune_sb_for_umount(struct super_block *sb)
+{
+   if (likely(READ_ONCE(ndblk.prune_sb) != sb))
+   return;
+   WRITE_ONCE(ndblk.prune_sb, NULL);
+   /*
+* Need to wait until an ongoing pruning operation, if present,
+* is completed.
+*
+* Clearing ndblk.prune_sb will hasten the completion of pruning.
+* In the unlikely event that ndblk.prune_sb is set to another
+* super_block, the waiting will last the complete pruning operation
+* which shouldn't be that long either.
+*/
+   spin_unlock_wait(_lock);
+}
+
 /**
  * enum d_walk_ret - action to talke during tree walk
  * @D_WALK_CONTINUE:   contrinue walk
@@ -1755,6 +1789,7 @@ void shrink_dcache_for_umount(struct super_block *sb)
 
WARN(down_read_trylock(>s_umount), "s_umount should've been 
locked");
 
+   clear_prune_sb_for_umount(sb);
dentry = sb->s_root;
sb->s_root = NULL;
do_one_tree(dentry);
@@ -3857,6 +3892,7 @@ static void __init neg_dentry_init(void)
unsigned long cnt;
 
raw_spin_lock_init(_lock);
+   spin_lock_init(_lock);
 
/* 20% in global pool & 80% in percpu free */
ndblk.nfree = neg_dentry_nfree_init
-- 
1.8.3.1



[PATCH v2 0/4] fs/dcache: Limit # of negative dentries

2017-07-21 Thread Waiman Long
 v1->v2:
  - Move the new nr_negative field to the end of dentry_stat_t structure
as suggested by Matthew Wilcox.
  - With the help of Miklos Szeredi, fix incorrect locking order in
dentry_kill() by using lock_parent() instead of locking the parent's
d_lock directly.
  - Correctly account for positive to negative dentry transitions.
  - Automatic pruning of negative dentries will now ignore the reference
bit in negative dentries but not the regular shrinking.

A rogue application can potentially create a large number of negative
dentries in the system consuming most of the memory available. This
can impact performance of other applications running on the system.

This patchset introduces changes to the dcache subsystem to limit
the number of negative dentries allowed to be created thus limiting
the amount of memory that can be consumed by negative dentries.

Patch 1 tracks the number of negative dentries used and disallow
the creation of more when the limit is reached.

Patch 2 enables /proc/sys/fs/dentry-state to report the number of
negative dentries in the system.

Patch 3 enables automatic pruning of negative dentries when it is
close to the limit so that we won't end up killing recently used
negative dentries.

Patch 4 prevents racing between negative dentry pruning and umount
operation.

Waiman Long (4):
  fs/dcache: Limit numbers of negative dentries
  fs/dcache: Report negative dentry number in dentry-state
  fs/dcache: Enable automatic pruning of negative dentries
  fs/dcache: Protect negative dentry pruning from racing with umount

 Documentation/admin-guide/kernel-parameters.txt |   7 +
 fs/dcache.c | 408 ++--
 include/linux/dcache.h  |   8 +-
 include/linux/list_lru.h|   1 +
 mm/list_lru.c   |   4 +-
 5 files changed, 392 insertions(+), 36 deletions(-)

-- 
1.8.3.1



[PATCH v2 3/4] fs/dcache: Enable automatic pruning of negative dentries

2017-07-21 Thread Waiman Long
Having a limit for the number of negative dentries does have an
undesirable side effect that no new negative dentries will be allowed
when the limit is reached. This will have performance implication
for some types of workloads.

So we need a way to prune the negative dentries so that new ones can
be created. This is done by using the workqueue API to do the pruning
gradually when a threshold is reached to minimize performance impact
on other running tasks. The pruning is done at a frequency of 10 runs
per second. Each run scans at most 256 LRU dentries for each node LRU
list of a certain superblock. Some non-negative dentries that happen
to be at the front of the LRU lists will also be pruned.

Signed-off-by: Waiman Long 
---
 fs/dcache.c  | 113 +++
 include/linux/list_lru.h |   1 +
 mm/list_lru.c|   4 +-
 3 files changed, 117 insertions(+), 1 deletion(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index d6d2d2d..c2ea876 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -134,13 +134,19 @@ struct dentry_stat_t dentry_stat = {
  * Macros and variables to manage and count negative dentries.
  */
 #define NEG_DENTRY_BATCH   (1 << 8)
+#define NEG_PRUNING_DELAY  (HZ/10)
 static long neg_dentry_percpu_limit __read_mostly;
 static long neg_dentry_nfree_init __read_mostly; /* Free pool initial value */
 static struct {
raw_spinlock_t nfree_lock;
long nfree; /* Negative dentry free pool */
+   struct super_block *prune_sb;   /* Super_block for pruning */
+   int neg_count, prune_count; /* Pruning counts */
 } ndblk cacheline_aligned_in_smp;
 
+static void prune_negative_dentry(struct work_struct *work);
+static DECLARE_DELAYED_WORK(prune_neg_dentry_work, prune_negative_dentry);
+
 static DEFINE_PER_CPU(long, nr_dentry);
 static DEFINE_PER_CPU(long, nr_dentry_unused);
 static DEFINE_PER_CPU(long, nr_dentry_neg);
@@ -323,6 +329,16 @@ static void __neg_dentry_inc(struct dentry *dentry)
 */
if (!cnt)
dentry->d_flags |= DCACHE_KILL_NEGATIVE;
+
+   /*
+* Initiate negative dentry pruning if free pool has less than
+* 1/4 of its initial value.
+*/
+   if (READ_ONCE(ndblk.nfree) < neg_dentry_nfree_init/4) {
+   WRITE_ONCE(ndblk.prune_sb, dentry->d_sb);
+   schedule_delayed_work(_neg_dentry_work,
+ NEG_PRUNING_DELAY);
+   }
 }
 
 static inline void neg_dentry_inc(struct dentry *dentry)
@@ -1320,6 +1336,103 @@ void shrink_dcache_sb(struct super_block *sb)
 }
 EXPORT_SYMBOL(shrink_dcache_sb);
 
+/*
+ * A modified version that attempts to remove a limited number of negative
+ * dentries as well as some other non-negative dentries at the front.
+ */
+static enum lru_status dentry_negative_lru_isolate(struct list_head *item,
+   struct list_lru_one *lru, spinlock_t *lru_lock, void *arg)
+{
+   struct list_head *freeable = arg;
+   struct dentry   *dentry = container_of(item, struct dentry, d_lru);
+   enum lru_status status = LRU_SKIP;
+
+   /*
+* Stop further list walking for the current node list to limit
+* performance impact, but allow further walking in the next node
+* list.
+*/
+   if ((ndblk.neg_count >= NEG_DENTRY_BATCH) ||
+   (ndblk.prune_count >= NEG_DENTRY_BATCH)) {
+   ndblk.prune_count = 0;
+   return LRU_STOP;
+   }
+
+   /*
+* we are inverting the lru lock/dentry->d_lock here,
+* so use a trylock. If we fail to get the lock, just skip
+* it
+*/
+   if (!spin_trylock(>d_lock)) {
+   ndblk.prune_count++;
+   return LRU_SKIP;
+   }
+
+   /*
+* Referenced dentries are still in use. If they have active
+* counts, just remove them from the LRU. Otherwise give them
+* another pass through the LRU.
+*/
+   if (dentry->d_lockref.count) {
+   d_lru_isolate(lru, dentry);
+   status = LRU_REMOVED;
+   goto out;
+   }
+
+   /*
+* Dentries with reference bit on are moved back to the tail
+* except for the negative ones.
+*/
+   if ((dentry->d_flags & DCACHE_REFERENCED) && !d_is_negative(dentry)) {
+   dentry->d_flags &= ~DCACHE_REFERENCED;
+   status = LRU_ROTATE;
+   goto out;
+   }
+
+   status = LRU_REMOVED;
+   d_lru_shrink_move(lru, dentry, freeable);
+   if (d_is_negative(dentry))
+   ndblk.neg_count++;
+out:
+   spin_unlock(>d_lock);
+   ndblk.prune_count++;
+   return status;
+}
+
+/*
+ * A workqueue function to prune negative dentry.
+ *
+ * The pruning is done gradually over time so as not to have noticeable
+ * performance impact.
+ */
+static void prune_negative_dentry(struct work_struct *work)
+{
+   int freed;
+ 

[PATCH v2 4/4] fs/dcache: Protect negative dentry pruning from racing with umount

2017-07-21 Thread Waiman Long
The negative dentry pruning is done on a specific super_block set
in the ndblk.prune_sb variable. If the super_block is also being
un-mounted concurrently, the content of the super_block may no longer
be valid.

To protect against such racing condition, a new lock is added to
the ndblk structure to synchronize the negative dentry pruning and
umount operation. This is a regular spinlock as the pruning operation
can be quite time consuming.

Signed-off-by: Waiman Long 
---
 fs/dcache.c | 42 +++---
 1 file changed, 39 insertions(+), 3 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index c2ea876..a3159f3 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -139,11 +139,13 @@ struct dentry_stat_t dentry_stat = {
 static long neg_dentry_nfree_init __read_mostly; /* Free pool initial value */
 static struct {
raw_spinlock_t nfree_lock;
+   spinlock_t prune_lock;  /* Lock for protecting pruning */
long nfree; /* Negative dentry free pool */
struct super_block *prune_sb;   /* Super_block for pruning */
int neg_count, prune_count; /* Pruning counts */
 } ndblk cacheline_aligned_in_smp;
 
+static void clear_prune_sb_for_umount(struct super_block *sb);
 static void prune_negative_dentry(struct work_struct *work);
 static DECLARE_DELAYED_WORK(prune_neg_dentry_work, prune_negative_dentry);
 
@@ -1323,6 +1325,7 @@ void shrink_dcache_sb(struct super_block *sb)
 {
long freed;
 
+   clear_prune_sb_for_umount(sb);
do {
LIST_HEAD(dispose);
 
@@ -1353,7 +1356,8 @@ static enum lru_status dentry_negative_lru_isolate(struct 
list_head *item,
 * list.
 */
if ((ndblk.neg_count >= NEG_DENTRY_BATCH) ||
-   (ndblk.prune_count >= NEG_DENTRY_BATCH)) {
+   (ndblk.prune_count >= NEG_DENTRY_BATCH) ||
+   !READ_ONCE(ndblk.prune_sb)) {
ndblk.prune_count = 0;
return LRU_STOP;
}
@@ -1408,15 +1412,24 @@ static enum lru_status 
dentry_negative_lru_isolate(struct list_head *item,
 static void prune_negative_dentry(struct work_struct *work)
 {
int freed;
-   struct super_block *sb = READ_ONCE(ndblk.prune_sb);
+   struct super_block *sb;
LIST_HEAD(dispose);
 
-   if (!sb)
+   /*
+* The prune_lock is used to protect negative dentry pruning from
+* racing with concurrent umount operation.
+*/
+   spin_lock(_lock);
+   sb = READ_ONCE(ndblk.prune_sb);
+   if (!sb) {
+   spin_unlock(_lock);
return;
+   }
 
ndblk.neg_count = ndblk.prune_count = 0;
freed = list_lru_walk(>s_dentry_lru, dentry_negative_lru_isolate,
  , NEG_DENTRY_BATCH);
+   spin_unlock(_lock);
 
if (freed)
shrink_dentry_list();
@@ -1433,6 +1446,27 @@ static void prune_negative_dentry(struct work_struct 
*work)
WRITE_ONCE(ndblk.prune_sb, NULL);
 }
 
+/*
+ * This is called before an umount to clear ndblk.prune_sb if it
+ * matches the given super_block.
+ */
+static void clear_prune_sb_for_umount(struct super_block *sb)
+{
+   if (likely(READ_ONCE(ndblk.prune_sb) != sb))
+   return;
+   WRITE_ONCE(ndblk.prune_sb, NULL);
+   /*
+* Need to wait until an ongoing pruning operation, if present,
+* is completed.
+*
+* Clearing ndblk.prune_sb will hasten the completion of pruning.
+* In the unlikely event that ndblk.prune_sb is set to another
+* super_block, the waiting will last the complete pruning operation
+* which shouldn't be that long either.
+*/
+   spin_unlock_wait(_lock);
+}
+
 /**
  * enum d_walk_ret - action to talke during tree walk
  * @D_WALK_CONTINUE:   contrinue walk
@@ -1755,6 +1789,7 @@ void shrink_dcache_for_umount(struct super_block *sb)
 
WARN(down_read_trylock(>s_umount), "s_umount should've been 
locked");
 
+   clear_prune_sb_for_umount(sb);
dentry = sb->s_root;
sb->s_root = NULL;
do_one_tree(dentry);
@@ -3857,6 +3892,7 @@ static void __init neg_dentry_init(void)
unsigned long cnt;
 
raw_spin_lock_init(_lock);
+   spin_lock_init(_lock);
 
/* 20% in global pool & 80% in percpu free */
ndblk.nfree = neg_dentry_nfree_init
-- 
1.8.3.1



[PATCH v2 2/4] fs/dcache: Report negative dentry number in dentry-state

2017-07-21 Thread Waiman Long
The number of negative dentries currently in the system is now reported
in the /proc/sys/fs/dentry-state file.

Signed-off-by: Waiman Long 
---
 fs/dcache.c| 16 +++-
 include/linux/dcache.h |  7 ---
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 866df38..d6d2d2d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -135,6 +135,7 @@ struct dentry_stat_t dentry_stat = {
  */
 #define NEG_DENTRY_BATCH   (1 << 8)
 static long neg_dentry_percpu_limit __read_mostly;
+static long neg_dentry_nfree_init __read_mostly; /* Free pool initial value */
 static struct {
raw_spinlock_t nfree_lock;
long nfree; /* Negative dentry free pool */
@@ -176,11 +177,23 @@ static long get_nr_dentry_unused(void)
return sum < 0 ? 0 : sum;
 }
 
+static long get_nr_dentry_neg(void)
+{
+   int i;
+   long sum = 0;
+
+   for_each_possible_cpu(i)
+   sum += per_cpu(nr_dentry_neg, i);
+   sum += neg_dentry_nfree_init - ndblk.nfree;
+   return sum < 0 ? 0 : sum;
+}
+
 int proc_nr_dentry(struct ctl_table *table, int write, void __user *buffer,
   size_t *lenp, loff_t *ppos)
 {
dentry_stat.nr_dentry = get_nr_dentry();
dentry_stat.nr_unused = get_nr_dentry_unused();
+   dentry_stat.nr_negative = get_nr_dentry_neg();
return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
 }
 #endif
@@ -3733,7 +3746,8 @@ static void __init neg_dentry_init(void)
raw_spin_lock_init(_lock);
 
/* 20% in global pool & 80% in percpu free */
-   ndblk.nfree = totalram_pages * nr_dentry_page * neg_dentry_pc / 500;
+   ndblk.nfree = neg_dentry_nfree_init
+   = totalram_pages * nr_dentry_page * neg_dentry_pc / 500;
cnt = ndblk.nfree * 4 / num_possible_cpus();
if (unlikely(cnt < 2 * NEG_DENTRY_BATCH))
cnt = 2 * NEG_DENTRY_BATCH;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 498233b..0f749e2 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -63,9 +63,10 @@ struct qstr {
 struct dentry_stat_t {
long nr_dentry;
long nr_unused;
-   long age_limit;  /* age in seconds */
-   long want_pages; /* pages requested by system */
-   long dummy[2];
+   long age_limit; /* age in seconds */
+   long want_pages;/* pages requested by system */
+   long nr_negative;   /* # of negative dentries */
+   long dummy;
 };
 extern struct dentry_stat_t dentry_stat;
 
-- 
1.8.3.1



[PATCH v2 2/4] fs/dcache: Report negative dentry number in dentry-state

2017-07-21 Thread Waiman Long
The number of negative dentries currently in the system is now reported
in the /proc/sys/fs/dentry-state file.

Signed-off-by: Waiman Long 
---
 fs/dcache.c| 16 +++-
 include/linux/dcache.h |  7 ---
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 866df38..d6d2d2d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -135,6 +135,7 @@ struct dentry_stat_t dentry_stat = {
  */
 #define NEG_DENTRY_BATCH   (1 << 8)
 static long neg_dentry_percpu_limit __read_mostly;
+static long neg_dentry_nfree_init __read_mostly; /* Free pool initial value */
 static struct {
raw_spinlock_t nfree_lock;
long nfree; /* Negative dentry free pool */
@@ -176,11 +177,23 @@ static long get_nr_dentry_unused(void)
return sum < 0 ? 0 : sum;
 }
 
+static long get_nr_dentry_neg(void)
+{
+   int i;
+   long sum = 0;
+
+   for_each_possible_cpu(i)
+   sum += per_cpu(nr_dentry_neg, i);
+   sum += neg_dentry_nfree_init - ndblk.nfree;
+   return sum < 0 ? 0 : sum;
+}
+
 int proc_nr_dentry(struct ctl_table *table, int write, void __user *buffer,
   size_t *lenp, loff_t *ppos)
 {
dentry_stat.nr_dentry = get_nr_dentry();
dentry_stat.nr_unused = get_nr_dentry_unused();
+   dentry_stat.nr_negative = get_nr_dentry_neg();
return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
 }
 #endif
@@ -3733,7 +3746,8 @@ static void __init neg_dentry_init(void)
raw_spin_lock_init(_lock);
 
/* 20% in global pool & 80% in percpu free */
-   ndblk.nfree = totalram_pages * nr_dentry_page * neg_dentry_pc / 500;
+   ndblk.nfree = neg_dentry_nfree_init
+   = totalram_pages * nr_dentry_page * neg_dentry_pc / 500;
cnt = ndblk.nfree * 4 / num_possible_cpus();
if (unlikely(cnt < 2 * NEG_DENTRY_BATCH))
cnt = 2 * NEG_DENTRY_BATCH;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 498233b..0f749e2 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -63,9 +63,10 @@ struct qstr {
 struct dentry_stat_t {
long nr_dentry;
long nr_unused;
-   long age_limit;  /* age in seconds */
-   long want_pages; /* pages requested by system */
-   long dummy[2];
+   long age_limit; /* age in seconds */
+   long want_pages;/* pages requested by system */
+   long nr_negative;   /* # of negative dentries */
+   long dummy;
 };
 extern struct dentry_stat_t dentry_stat;
 
-- 
1.8.3.1



[PATCH] staging: rtl8192u: fix incorrect mask and shift on u8 data

2017-07-21 Thread Colin King
From: Colin Ian King 

The cfg_action bit comes from the high bit of pmsg[4] so the
current mask and shift are in correct and always result in
zero.  Fix this by using the correct mask and shif to get the
correct cfg_action bit value.

Detected by CoverityScan, CID#142890 ("Operands don't affect result")

Signed-off-by: Colin Ian King 
---
 drivers/staging/rtl8192u/r819xU_cmdpkt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/rtl8192u/r819xU_cmdpkt.c 
b/drivers/staging/rtl8192u/r819xU_cmdpkt.c
index 87ab3ba760fc..ae9a4f1ac8fd 100644
--- a/drivers/staging/rtl8192u/r819xU_cmdpkt.c
+++ b/drivers/staging/rtl8192u/r819xU_cmdpkt.c
@@ -294,7 +294,7 @@ static void cmpk_handle_query_config_rx(struct net_device 
*dev, u8 *pmsg)
 * windows OS. So we have to read the content byte by byte or transfer
 * endian type before copy the message copy.
 */
-   rx_query_cfg.cfg_action = (pmsg[4] & 0x8000) >> 31;
+   rx_query_cfg.cfg_action = (pmsg[4] & 0x80) >> 7;
rx_query_cfg.cfg_type   = (pmsg[4] & 0x60) >> 5;
rx_query_cfg.cfg_size   = (pmsg[4] & 0x18) >> 3;
rx_query_cfg.cfg_page   = (pmsg[6] & 0x0F) >> 0;
-- 
2.11.0



[PATCH] staging: rtl8192u: fix incorrect mask and shift on u8 data

2017-07-21 Thread Colin King
From: Colin Ian King 

The cfg_action bit comes from the high bit of pmsg[4] so the
current mask and shift are in correct and always result in
zero.  Fix this by using the correct mask and shif to get the
correct cfg_action bit value.

Detected by CoverityScan, CID#142890 ("Operands don't affect result")

Signed-off-by: Colin Ian King 
---
 drivers/staging/rtl8192u/r819xU_cmdpkt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/rtl8192u/r819xU_cmdpkt.c 
b/drivers/staging/rtl8192u/r819xU_cmdpkt.c
index 87ab3ba760fc..ae9a4f1ac8fd 100644
--- a/drivers/staging/rtl8192u/r819xU_cmdpkt.c
+++ b/drivers/staging/rtl8192u/r819xU_cmdpkt.c
@@ -294,7 +294,7 @@ static void cmpk_handle_query_config_rx(struct net_device 
*dev, u8 *pmsg)
 * windows OS. So we have to read the content byte by byte or transfer
 * endian type before copy the message copy.
 */
-   rx_query_cfg.cfg_action = (pmsg[4] & 0x8000) >> 31;
+   rx_query_cfg.cfg_action = (pmsg[4] & 0x80) >> 7;
rx_query_cfg.cfg_type   = (pmsg[4] & 0x60) >> 5;
rx_query_cfg.cfg_size   = (pmsg[4] & 0x18) >> 3;
rx_query_cfg.cfg_page   = (pmsg[6] & 0x0F) >> 0;
-- 
2.11.0



Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac

2017-07-21 Thread Mauro Carvalho Chehab
Em Fri, 21 Jul 2017 15:34:41 +0200
Borislav Petkov  escreveu:

> On Thu, Jul 20, 2017 at 07:50:03PM +, Kani, Toshimitsu wrote:
> > GHES / firmware-first still requires OS recovery actions when an error
> > cannot be corrected by the platform.  They are handled by ghes_proc(),
> > and ghes_edac remains its error-reporting wrapper. 

What happens when the error can be corrected? Does it still report it to
userspace, or just silently hide the error?

If I remember well about a past discussion with some vendor, I was told
that the firmware can hide some errors from being reported. Is it
still the case?


Thanks,
Mauro


Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac

2017-07-21 Thread Mauro Carvalho Chehab
Em Fri, 21 Jul 2017 15:34:41 +0200
Borislav Petkov  escreveu:

> On Thu, Jul 20, 2017 at 07:50:03PM +, Kani, Toshimitsu wrote:
> > GHES / firmware-first still requires OS recovery actions when an error
> > cannot be corrected by the platform.  They are handled by ghes_proc(),
> > and ghes_edac remains its error-reporting wrapper. 

What happens when the error can be corrected? Does it still report it to
userspace, or just silently hide the error?

If I remember well about a past discussion with some vendor, I was told
that the firmware can hide some errors from being reported. Is it
still the case?


Thanks,
Mauro


Problem to compile Kernel 3.18.55 in fork.c:341

2017-07-21 Thread Sam Przyswa (Perso)

Hi all !

I try to compile the kernel 3.18.55 I got this error message during 
'make deb-pkg' :


kernel/built-in.o: In function `dup_task_struct':
/home/samp/kernel/linux-3.18.55/kernel/fork.c:341: undefined reference 
to `get_random_long'

Makefile:927: recipe for target 'vmlinux' failed
make[2]: *** [vmlinux] Error 1
scripts/package/Makefile:90: recipe for target 'deb-pkg' failed
make[1]: *** [deb-pkg] Error 2
Makefile:1233: recipe for target 'deb-pkg' failed
make: *** [deb-pkg] Error 2

What's wrong ?

Thanks for your help.

Sam.



Problem to compile Kernel 3.18.55 in fork.c:341

2017-07-21 Thread Sam Przyswa (Perso)

Hi all !

I try to compile the kernel 3.18.55 I got this error message during 
'make deb-pkg' :


kernel/built-in.o: In function `dup_task_struct':
/home/samp/kernel/linux-3.18.55/kernel/fork.c:341: undefined reference 
to `get_random_long'

Makefile:927: recipe for target 'vmlinux' failed
make[2]: *** [vmlinux] Error 1
scripts/package/Makefile:90: recipe for target 'deb-pkg' failed
make[1]: *** [deb-pkg] Error 2
Makefile:1233: recipe for target 'deb-pkg' failed
make: *** [deb-pkg] Error 2

What's wrong ?

Thanks for your help.

Sam.



[PATCH net 2/2] bpf/verifier: fix min/max handling in BPF_SUB

2017-07-21 Thread Edward Cree
We have to subtract the src max from the dst min, and vice-versa, since
 (e.g.) the smallest result comes from the largest subtrahend.

Fixes: 484611357c19 ("bpf: allow access into map value arrays")
Signed-off-by: Edward Cree 
---
 kernel/bpf/verifier.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index af9e84a..664d939 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1865,10 +1865,12 @@ static void adjust_reg_min_max_vals(struct 
bpf_verifier_env *env,
 * do our normal operations to the register, we need to set the values
 * to the min/max since they are undefined.
 */
-   if (min_val == BPF_REGISTER_MIN_RANGE)
-   dst_reg->min_value = BPF_REGISTER_MIN_RANGE;
-   if (max_val == BPF_REGISTER_MAX_RANGE)
-   dst_reg->max_value = BPF_REGISTER_MAX_RANGE;
+   if (opcode != BPF_SUB) {
+   if (min_val == BPF_REGISTER_MIN_RANGE)
+   dst_reg->min_value = BPF_REGISTER_MIN_RANGE;
+   if (max_val == BPF_REGISTER_MAX_RANGE)
+   dst_reg->max_value = BPF_REGISTER_MAX_RANGE;
+   }
 
switch (opcode) {
case BPF_ADD:
@@ -1879,10 +1881,17 @@ static void adjust_reg_min_max_vals(struct 
bpf_verifier_env *env,
dst_reg->min_align = min(src_align, dst_align);
break;
case BPF_SUB:
+   /* If one of our values was at the end of our ranges, then the
+* _opposite_ value in the dst_reg goes to the end of our range.
+*/
+   if (min_val == BPF_REGISTER_MIN_RANGE)
+   dst_reg->max_value = BPF_REGISTER_MAX_RANGE;
+   if (max_val == BPF_REGISTER_MAX_RANGE)
+   dst_reg->min_value = BPF_REGISTER_MIN_RANGE;
if (dst_reg->min_value != BPF_REGISTER_MIN_RANGE)
-   dst_reg->min_value -= min_val;
+   dst_reg->min_value -= max_val;
if (dst_reg->max_value != BPF_REGISTER_MAX_RANGE)
-   dst_reg->max_value -= max_val;
+   dst_reg->max_value -= min_val;
dst_reg->min_align = min(src_align, dst_align);
break;
case BPF_MUL:


[PATCH net 2/2] bpf/verifier: fix min/max handling in BPF_SUB

2017-07-21 Thread Edward Cree
We have to subtract the src max from the dst min, and vice-versa, since
 (e.g.) the smallest result comes from the largest subtrahend.

Fixes: 484611357c19 ("bpf: allow access into map value arrays")
Signed-off-by: Edward Cree 
---
 kernel/bpf/verifier.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index af9e84a..664d939 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1865,10 +1865,12 @@ static void adjust_reg_min_max_vals(struct 
bpf_verifier_env *env,
 * do our normal operations to the register, we need to set the values
 * to the min/max since they are undefined.
 */
-   if (min_val == BPF_REGISTER_MIN_RANGE)
-   dst_reg->min_value = BPF_REGISTER_MIN_RANGE;
-   if (max_val == BPF_REGISTER_MAX_RANGE)
-   dst_reg->max_value = BPF_REGISTER_MAX_RANGE;
+   if (opcode != BPF_SUB) {
+   if (min_val == BPF_REGISTER_MIN_RANGE)
+   dst_reg->min_value = BPF_REGISTER_MIN_RANGE;
+   if (max_val == BPF_REGISTER_MAX_RANGE)
+   dst_reg->max_value = BPF_REGISTER_MAX_RANGE;
+   }
 
switch (opcode) {
case BPF_ADD:
@@ -1879,10 +1881,17 @@ static void adjust_reg_min_max_vals(struct 
bpf_verifier_env *env,
dst_reg->min_align = min(src_align, dst_align);
break;
case BPF_SUB:
+   /* If one of our values was at the end of our ranges, then the
+* _opposite_ value in the dst_reg goes to the end of our range.
+*/
+   if (min_val == BPF_REGISTER_MIN_RANGE)
+   dst_reg->max_value = BPF_REGISTER_MAX_RANGE;
+   if (max_val == BPF_REGISTER_MAX_RANGE)
+   dst_reg->min_value = BPF_REGISTER_MIN_RANGE;
if (dst_reg->min_value != BPF_REGISTER_MIN_RANGE)
-   dst_reg->min_value -= min_val;
+   dst_reg->min_value -= max_val;
if (dst_reg->max_value != BPF_REGISTER_MAX_RANGE)
-   dst_reg->max_value -= max_val;
+   dst_reg->max_value -= min_val;
dst_reg->min_align = min(src_align, dst_align);
break;
case BPF_MUL:


[PATCH net 1/2] selftests/bpf: subtraction bounds test

2017-07-21 Thread Edward Cree
There is a bug in the verifier's handling of BPF_SUB: [a,b] - [c,d] yields
 was [a-c, b-d] rather than the correct [a-d, b-c].  So here is a test
 which, with the bogus handling, will produce ranges of [0,0] and thus
 allowed accesses; whereas the correct handling will give a range of
 [-255, 255] (and hence the right-shift will give a range of [0, 255]) and
 the accesses will be rejected.

Signed-off-by: Edward Cree 
---
 tools/testing/selftests/bpf/test_verifier.c | 28 
 1 file changed, 28 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index af7d173..addea82 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -5980,6 +5980,34 @@ static struct bpf_test tests[] = {
.result = REJECT,
.result_unpriv = REJECT,
},
+   {
+   "subtraction bounds (map value)",
+   .insns = {
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+BPF_FUNC_map_lookup_elem),
+   BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 9),
+   BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
+   BPF_JMP_IMM(BPF_JGT, BPF_REG_1, 0xff, 7),
+   BPF_LDX_MEM(BPF_B, BPF_REG_3, BPF_REG_0, 1),
+   BPF_JMP_IMM(BPF_JGT, BPF_REG_3, 0xff, 5),
+   BPF_ALU64_REG(BPF_SUB, BPF_REG_1, BPF_REG_3),
+   BPF_ALU64_IMM(BPF_RSH, BPF_REG_1, 56),
+   BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1),
+   BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .fixup_map1 = { 3 },
+   .errstr_unpriv = "R0 pointer arithmetic prohibited",
+   .errstr = "R0 min value is negative, either use unsigned index 
or do a if (index >=0) check.",
+   .result = REJECT,
+   .result_unpriv = REJECT,
+   },
 };
 
 static int probe_filter_length(const struct bpf_insn *fp)



[PATCH net 1/2] selftests/bpf: subtraction bounds test

2017-07-21 Thread Edward Cree
There is a bug in the verifier's handling of BPF_SUB: [a,b] - [c,d] yields
 was [a-c, b-d] rather than the correct [a-d, b-c].  So here is a test
 which, with the bogus handling, will produce ranges of [0,0] and thus
 allowed accesses; whereas the correct handling will give a range of
 [-255, 255] (and hence the right-shift will give a range of [0, 255]) and
 the accesses will be rejected.

Signed-off-by: Edward Cree 
---
 tools/testing/selftests/bpf/test_verifier.c | 28 
 1 file changed, 28 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index af7d173..addea82 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -5980,6 +5980,34 @@ static struct bpf_test tests[] = {
.result = REJECT,
.result_unpriv = REJECT,
},
+   {
+   "subtraction bounds (map value)",
+   .insns = {
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+BPF_FUNC_map_lookup_elem),
+   BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 9),
+   BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
+   BPF_JMP_IMM(BPF_JGT, BPF_REG_1, 0xff, 7),
+   BPF_LDX_MEM(BPF_B, BPF_REG_3, BPF_REG_0, 1),
+   BPF_JMP_IMM(BPF_JGT, BPF_REG_3, 0xff, 5),
+   BPF_ALU64_REG(BPF_SUB, BPF_REG_1, BPF_REG_3),
+   BPF_ALU64_IMM(BPF_RSH, BPF_REG_1, 56),
+   BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1),
+   BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .fixup_map1 = { 3 },
+   .errstr_unpriv = "R0 pointer arithmetic prohibited",
+   .errstr = "R0 min value is negative, either use unsigned index 
or do a if (index >=0) check.",
+   .result = REJECT,
+   .result_unpriv = REJECT,
+   },
 };
 
 static int probe_filter_length(const struct bpf_insn *fp)



[PATCH net 0/2] bpf: fix verifier min/max handling in BPF_SUB

2017-07-21 Thread Edward Cree
I managed to come up with a test for the swapped bounds in BPF_SUB, so here
 it is along with a patch that fixes it, separated out from my 'rewrite
 everything' series so it can go to -stable.

Edward Cree (2):
  selftests/bpf: subtraction bounds test
  bpf/verifier: fix min/max handling in BPF_SUB

 kernel/bpf/verifier.c   | 21 +++--
 tools/testing/selftests/bpf/test_verifier.c | 28 
 2 files changed, 43 insertions(+), 6 deletions(-)



Re: [RESEND PATCH] c6x: defconfig: Cleanup from old Kconfig options

2017-07-21 Thread Mark Salter
On Thu, 2017-07-20 at 06:58 +0200, Krzysztof Kozlowski wrote:
> Remove old, dead Kconfig options (in order appearing in this commit):
>  - EXPERIMENTAL is gone since v3.9;
>  - MISC_DEVICES: commit 7c5763b8453a ("drivers: misc: Remove
>MISC_DEVICES config option");
> 
> Signed-off-by: Krzysztof Kozlowski 

Thanks, I pulled this into the c6x tree.



[PATCH net 0/2] bpf: fix verifier min/max handling in BPF_SUB

2017-07-21 Thread Edward Cree
I managed to come up with a test for the swapped bounds in BPF_SUB, so here
 it is along with a patch that fixes it, separated out from my 'rewrite
 everything' series so it can go to -stable.

Edward Cree (2):
  selftests/bpf: subtraction bounds test
  bpf/verifier: fix min/max handling in BPF_SUB

 kernel/bpf/verifier.c   | 21 +++--
 tools/testing/selftests/bpf/test_verifier.c | 28 
 2 files changed, 43 insertions(+), 6 deletions(-)



Re: [RESEND PATCH] c6x: defconfig: Cleanup from old Kconfig options

2017-07-21 Thread Mark Salter
On Thu, 2017-07-20 at 06:58 +0200, Krzysztof Kozlowski wrote:
> Remove old, dead Kconfig options (in order appearing in this commit):
>  - EXPERIMENTAL is gone since v3.9;
>  - MISC_DEVICES: commit 7c5763b8453a ("drivers: misc: Remove
>MISC_DEVICES config option");
> 
> Signed-off-by: Krzysztof Kozlowski 

Thanks, I pulled this into the c6x tree.



Re: [PATCH v5 0/2] add UniPhier thermal support

2017-07-21 Thread Masami Hiramatsu
Hello,

2017-07-21 22:24 GMT+09:00 Masahiro Yamada :
> 2017-07-21 20:21 GMT+09:00 Kunihiko Hayashi :
>> This series adds support for CPU temperature monitor modules implemented
>> on UniPhier LD20 and PXs2 SoCs. This driver supports temperature monitoring
>> and alert function on the module.
>>
>> Changes in v4:
>> - fix warnings from sparse by replacing u32 with __be32
>
> Nit.
>
> Your subject prefix indicates this patch is v5.
>
> Do you mean "Changes in v5" or "Changes since v4" ?

I've checked all previous series and it seems to mean "Changes from vN".

Thanks,

>
>
>
>
>
>
>> Changes in v3:
>> - remove TMOD_MASK and use TMOD_WIDTH representing the bit width of TMOD
>>
>> Changes in v2:
>> - add nsleep after starting and stopping PVT
>> - replace temperature calculation with sign_extend32()
>>
>> Changes in v1:
>> - separate dts from this patchset as another patchset
>> - remove 'reg' description on the dt-bindings document
>> - fix the order of calling initialization functions
>> - replace mask bits to use GENMASK
>> - fix calculation of temperature because of not considering a negative value
>> - use devm_request_threaded_irq() instead of devm_request_irq() and
>>   separate a thread function from the interrupt handler
>> - add dependency to Kconfig
>> - set 120C to CRITICAL_TEMP_LIMIT as maximum temperature
>> - shrink each line of parameters to save the number of lines
>> - improve some comments and copyright description
>>
>> Kunihiko Hayashi (2):
>>   dt-bindings: thermal: add binding documentation for UniPhier thermal
>> monitor
>>   thermal: uniphier: add UniPhier thermal driver
>>
>>  .../bindings/thermal/uniphier-thermal.txt  |  64 
>>  drivers/thermal/Kconfig|   8 +
>>  drivers/thermal/Makefile   |   1 +
>>  drivers/thermal/uniphier_thermal.c | 386 
>> +
>>  4 files changed, 459 insertions(+)
>>  create mode 100644 
>> Documentation/devicetree/bindings/thermal/uniphier-thermal.txt
>>  create mode 100644 drivers/thermal/uniphier_thermal.c
>>
>> --
>> 2.7.4
>>
>
>
>
> --
> Best Regards
> Masahiro Yamada



-- 
Masami Hiramatsu


Re: [PATCH v5 0/2] add UniPhier thermal support

2017-07-21 Thread Masami Hiramatsu
Hello,

2017-07-21 22:24 GMT+09:00 Masahiro Yamada :
> 2017-07-21 20:21 GMT+09:00 Kunihiko Hayashi :
>> This series adds support for CPU temperature monitor modules implemented
>> on UniPhier LD20 and PXs2 SoCs. This driver supports temperature monitoring
>> and alert function on the module.
>>
>> Changes in v4:
>> - fix warnings from sparse by replacing u32 with __be32
>
> Nit.
>
> Your subject prefix indicates this patch is v5.
>
> Do you mean "Changes in v5" or "Changes since v4" ?

I've checked all previous series and it seems to mean "Changes from vN".

Thanks,

>
>
>
>
>
>
>> Changes in v3:
>> - remove TMOD_MASK and use TMOD_WIDTH representing the bit width of TMOD
>>
>> Changes in v2:
>> - add nsleep after starting and stopping PVT
>> - replace temperature calculation with sign_extend32()
>>
>> Changes in v1:
>> - separate dts from this patchset as another patchset
>> - remove 'reg' description on the dt-bindings document
>> - fix the order of calling initialization functions
>> - replace mask bits to use GENMASK
>> - fix calculation of temperature because of not considering a negative value
>> - use devm_request_threaded_irq() instead of devm_request_irq() and
>>   separate a thread function from the interrupt handler
>> - add dependency to Kconfig
>> - set 120C to CRITICAL_TEMP_LIMIT as maximum temperature
>> - shrink each line of parameters to save the number of lines
>> - improve some comments and copyright description
>>
>> Kunihiko Hayashi (2):
>>   dt-bindings: thermal: add binding documentation for UniPhier thermal
>> monitor
>>   thermal: uniphier: add UniPhier thermal driver
>>
>>  .../bindings/thermal/uniphier-thermal.txt  |  64 
>>  drivers/thermal/Kconfig|   8 +
>>  drivers/thermal/Makefile   |   1 +
>>  drivers/thermal/uniphier_thermal.c | 386 
>> +
>>  4 files changed, 459 insertions(+)
>>  create mode 100644 
>> Documentation/devicetree/bindings/thermal/uniphier-thermal.txt
>>  create mode 100644 drivers/thermal/uniphier_thermal.c
>>
>> --
>> 2.7.4
>>
>
>
>
> --
> Best Regards
> Masahiro Yamada



-- 
Masami Hiramatsu


Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac

2017-07-21 Thread Borislav Petkov
On Thu, Jul 20, 2017 at 07:50:03PM +, Kani, Toshimitsu wrote:
> GHES / firmware-first still requires OS recovery actions when an error
> cannot be corrected by the platform.  They are handled by ghes_proc(),
> and ghes_edac remains its error-reporting wrapper.

I mean all the recovery actions the firmware does because it gets to see
the error first. Otherwise, Firmware First is the the dumbest repeater
layer in the history of layers.

> Firmware has better knowledge about the platform and can provide better
> RAS when implemented properly.

s/when/if/

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--


Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac

2017-07-21 Thread Borislav Petkov
On Thu, Jul 20, 2017 at 07:50:03PM +, Kani, Toshimitsu wrote:
> GHES / firmware-first still requires OS recovery actions when an error
> cannot be corrected by the platform.  They are handled by ghes_proc(),
> and ghes_edac remains its error-reporting wrapper.

I mean all the recovery actions the firmware does because it gets to see
the error first. Otherwise, Firmware First is the the dumbest repeater
layer in the history of layers.

> Firmware has better knowledge about the platform and can provide better
> RAS when implemented properly.

s/when/if/

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--


Re: [PATCH] lib/int_sqrt.c: Optimize square root function

2017-07-21 Thread Peter Zijlstra
On Fri, Jul 21, 2017 at 03:26:21PM +0200, Peter Zijlstra wrote:
> 
> EVENT=0 -DNEW=1 -DFLS=1
> event: 19.626050 +- 0.038995
> EVENT=0 -DNEW=1 -DFLS=1 -DWIPE_BTB=1
> event: 109.610670 +- 0.425667
> 
> EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1
> event: 21.445680 +- 0.043782
> EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1 -DWIPE_BTB=1
> event: 83.590420 +- 0.142126
> 

> Let me dig out another GCC version current:
> 
>   gcc (Debian 6.3.0-18) 6.3.0 20170516

gcc-7 (Debian 7.1.0-9) 7.1.0

EVENT=0 -DNEW=1 -DFLS=1
event: 24.179400 +- 0.031344
EVENT=0 -DNEW=1 -DFLS=1 -DWIPE_BTB=1
event: 137.892390 +- 0.307314

EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1
event: 22.740300 +- 0.051317
EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1 -DWIPE_BTB=1
event: 136.980640 +- 0.223410


GCC regressed it seems... *sigh*


Re: [PATCH] lib/int_sqrt.c: Optimize square root function

2017-07-21 Thread Peter Zijlstra
On Fri, Jul 21, 2017 at 03:26:21PM +0200, Peter Zijlstra wrote:
> 
> EVENT=0 -DNEW=1 -DFLS=1
> event: 19.626050 +- 0.038995
> EVENT=0 -DNEW=1 -DFLS=1 -DWIPE_BTB=1
> event: 109.610670 +- 0.425667
> 
> EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1
> event: 21.445680 +- 0.043782
> EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1 -DWIPE_BTB=1
> event: 83.590420 +- 0.142126
> 

> Let me dig out another GCC version current:
> 
>   gcc (Debian 6.3.0-18) 6.3.0 20170516

gcc-7 (Debian 7.1.0-9) 7.1.0

EVENT=0 -DNEW=1 -DFLS=1
event: 24.179400 +- 0.031344
EVENT=0 -DNEW=1 -DFLS=1 -DWIPE_BTB=1
event: 137.892390 +- 0.307314

EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1
event: 22.740300 +- 0.051317
EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1 -DWIPE_BTB=1
event: 136.980640 +- 0.223410


GCC regressed it seems... *sigh*


Re: [PATCH 3/4] ACPI / APEI: Drop uninformative messages during boot

2017-07-21 Thread Borislav Petkov
On Thu, Jul 20, 2017 at 06:50:51PM +0100, Punit Agrawal wrote:
> "Firmware does not support APEI firmware first mode"
> 
> Thoughts?

I guess the simplest would be to add a third state to that hest_disable
to denote "HEST table not found" and then exit ghes_init() early, based on
checking it.

Otherwise ghes_init() inits a bunch of things which you probably don't
want on a platform which doesn't support APEI.

-- 
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
-- 


Re: [PATCH 3/4] ACPI / APEI: Drop uninformative messages during boot

2017-07-21 Thread Borislav Petkov
On Thu, Jul 20, 2017 at 06:50:51PM +0100, Punit Agrawal wrote:
> "Firmware does not support APEI firmware first mode"
> 
> Thoughts?

I guess the simplest would be to add a third state to that hest_disable
to denote "HEST table not found" and then exit ghes_init() early, based on
checking it.

Otherwise ghes_init() inits a bunch of things which you probably don't
want on a platform which doesn't support APEI.

-- 
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
-- 


Re: [PATCH] lib/int_sqrt.c: Optimize square root function

2017-07-21 Thread Peter Zijlstra
On Fri, Jul 21, 2017 at 05:15:10AM -0700, Joe Perches wrote:
> On Fri, 2017-07-21 at 13:40 +0200, Peter Zijlstra wrote:
> > @@ -21,7 +22,11 @@ unsigned long int_sqrt(unsigned long x)
> > if (x <= 1)
> > return x;
> >  
> > -   m = 1UL << (BITS_PER_LONG - 2);
> > +   m = 1UL << (__fls(x) & ~1U);
> > +
> > +   while (m > x)
> > +   m >>= 2;
> 
> while (m > x) ?
> 
> Belt and suspenders if __fls is broken?

Hmm... you're right, that should not happen. It is a remnant from when I
rounded up, like:

m = 1UL << ((__fls(x) + 1) & ~1UL);

Because I worried about the case where m == x, which is not included in
the loop above (but works when you look at the actual computation loop
and passes VALIDATE=1).


But check this... I cannot explain :/

When I remove that loop, we, as fully expected, loose 1 branch, but the
cycle count for the branch-cold case shoots up. Must be something GCC
does.


EVENT=0 -DNEW=1 -DFLS=1
event: 19.626050 +- 0.038995
EVENT=0 -DNEW=1 -DFLS=1 -DWIPE_BTB=1
event: 109.610670 +- 0.425667

EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1
event: 21.445680 +- 0.043782
EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1 -DWIPE_BTB=1
event: 83.590420 +- 0.142126


EVENT=4 -DNEW=1 -DFLS=1
event: 20.252330 +- 0.005265
EVENT=4 -DNEW=1 -DFLS=1 -DWIPE_BTB=1
event: 20.252340 +- 0.005265

EVENT=4 -DNEW=1 -DFLS=1 -DANSHUL=1
event: 21.252300 +- 0.005266
EVENT=4 -DNEW=1 -DFLS=1 -DANSHUL=1 -DWIPE_BTB=1
event: 21.252300 +- 0.005266


EVENT=5 -DNEW=1 -DFLS=1
event: 0.019370 +- 0.000732
EVENT=5 -DNEW=1 -DFLS=1 -DWIPE_BTB=1
event: 3.665240 +- 0.005309

EVENT=5 -DNEW=1 -DFLS=1 -DANSHUL=1
event: 0.020150 +- 0.000755
EVENT=5 -DNEW=1 -DFLS=1 -DANSHUL=1 -DWIPE_BTB=1
event: 2.225330 +- 0.004875


Let me dig out another GCC version current:

  gcc (Debian 6.3.0-18) 6.3.0 20170516



Re: [PATCH] lib/int_sqrt.c: Optimize square root function

2017-07-21 Thread Peter Zijlstra
On Fri, Jul 21, 2017 at 05:15:10AM -0700, Joe Perches wrote:
> On Fri, 2017-07-21 at 13:40 +0200, Peter Zijlstra wrote:
> > @@ -21,7 +22,11 @@ unsigned long int_sqrt(unsigned long x)
> > if (x <= 1)
> > return x;
> >  
> > -   m = 1UL << (BITS_PER_LONG - 2);
> > +   m = 1UL << (__fls(x) & ~1U);
> > +
> > +   while (m > x)
> > +   m >>= 2;
> 
> while (m > x) ?
> 
> Belt and suspenders if __fls is broken?

Hmm... you're right, that should not happen. It is a remnant from when I
rounded up, like:

m = 1UL << ((__fls(x) + 1) & ~1UL);

Because I worried about the case where m == x, which is not included in
the loop above (but works when you look at the actual computation loop
and passes VALIDATE=1).


But check this... I cannot explain :/

When I remove that loop, we, as fully expected, loose 1 branch, but the
cycle count for the branch-cold case shoots up. Must be something GCC
does.


EVENT=0 -DNEW=1 -DFLS=1
event: 19.626050 +- 0.038995
EVENT=0 -DNEW=1 -DFLS=1 -DWIPE_BTB=1
event: 109.610670 +- 0.425667

EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1
event: 21.445680 +- 0.043782
EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1 -DWIPE_BTB=1
event: 83.590420 +- 0.142126


EVENT=4 -DNEW=1 -DFLS=1
event: 20.252330 +- 0.005265
EVENT=4 -DNEW=1 -DFLS=1 -DWIPE_BTB=1
event: 20.252340 +- 0.005265

EVENT=4 -DNEW=1 -DFLS=1 -DANSHUL=1
event: 21.252300 +- 0.005266
EVENT=4 -DNEW=1 -DFLS=1 -DANSHUL=1 -DWIPE_BTB=1
event: 21.252300 +- 0.005266


EVENT=5 -DNEW=1 -DFLS=1
event: 0.019370 +- 0.000732
EVENT=5 -DNEW=1 -DFLS=1 -DWIPE_BTB=1
event: 3.665240 +- 0.005309

EVENT=5 -DNEW=1 -DFLS=1 -DANSHUL=1
event: 0.020150 +- 0.000755
EVENT=5 -DNEW=1 -DFLS=1 -DANSHUL=1 -DWIPE_BTB=1
event: 2.225330 +- 0.004875


Let me dig out another GCC version current:

  gcc (Debian 6.3.0-18) 6.3.0 20170516



Re: [PATCH] GFS2: fix code parameter error in inode_go_lock

2017-07-21 Thread Bob Peterson
- Original Message -
| In inode_go_lock() function, the parameter order of list_add() is error.
| According to the define of list_add(), the first parameter is new entry
| and the second is the list head, so ip->i_trunc_list should be the
| first parameter and the sdp->sd_trunc_list should be second.
| 
| Signed-off-by: Wang Xibo
| Signed-off-by: Xiao Likun
| ---
|  fs/gfs2/glops.c | 2 +-
|  1 file changed, 1 insertion(+), 1 deletion(-)
| 
| diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
| index 5e69636..28c203a 100644
| --- a/fs/gfs2/glops.c
| +++ b/fs/gfs2/glops.c
| @@ -470,7 +470,7 @@ static int inode_go_lock(struct gfs2_holder *gh)
|   (gh->gh_state == LM_ST_EXCLUSIVE)) {
|   spin_lock(>sd_trunc_lock);
|   if (list_empty(>i_trunc_list))
| - list_add(>sd_trunc_list, >i_trunc_list);
| + list_add(>i_trunc_list, >sd_trunc_list);
|   spin_unlock(>sd_trunc_lock);
|   wake_up(>sd_quota_wait);
|   return 1;
| --
| 1.8.3.1
| 
| 
| 
Hi,

Good catch! This is now applied to the for-next branch of the linux-gfs2 tree:
https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/fs/gfs2?h=for-next=e7cb550d79fba5876616ed33ccafafc4e2bd7e2e

Regards,

Bob Peterson
Red Hat File Systems


Re: [PATCH] GFS2: fix code parameter error in inode_go_lock

2017-07-21 Thread Bob Peterson
- Original Message -
| In inode_go_lock() function, the parameter order of list_add() is error.
| According to the define of list_add(), the first parameter is new entry
| and the second is the list head, so ip->i_trunc_list should be the
| first parameter and the sdp->sd_trunc_list should be second.
| 
| Signed-off-by: Wang Xibo
| Signed-off-by: Xiao Likun
| ---
|  fs/gfs2/glops.c | 2 +-
|  1 file changed, 1 insertion(+), 1 deletion(-)
| 
| diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
| index 5e69636..28c203a 100644
| --- a/fs/gfs2/glops.c
| +++ b/fs/gfs2/glops.c
| @@ -470,7 +470,7 @@ static int inode_go_lock(struct gfs2_holder *gh)
|   (gh->gh_state == LM_ST_EXCLUSIVE)) {
|   spin_lock(>sd_trunc_lock);
|   if (list_empty(>i_trunc_list))
| - list_add(>sd_trunc_list, >i_trunc_list);
| + list_add(>i_trunc_list, >sd_trunc_list);
|   spin_unlock(>sd_trunc_lock);
|   wake_up(>sd_quota_wait);
|   return 1;
| --
| 1.8.3.1
| 
| 
| 
Hi,

Good catch! This is now applied to the for-next branch of the linux-gfs2 tree:
https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/fs/gfs2?h=for-next=e7cb550d79fba5876616ed33ccafafc4e2bd7e2e

Regards,

Bob Peterson
Red Hat File Systems


Re: [PATCH v2 8/8] KVM: arm/arm64: register DEOI irq bypass consumer on ARM/ARM64

2017-07-21 Thread Christoffer Dall
On Thu, Jun 15, 2017 at 02:52:40PM +0200, Eric Auger wrote:
> This patch selects IRQ_BYPASS_MANAGER and HAVE_KVM_IRQ_BYPASS
> configs for ARM/ARM64.
> 
> kvm_arch_has_irq_bypass() now is implemented and returns true.
> As a consequence the irq bypass consumer will be registered for
> ARM/ARM64 with Direct EOI/IRQ forwarding callbacks:
> 
> - stop/start: halt/resume guest execution
> - add/del_producer: set/unset forwarding/DEOI at vgic/irqchip level
> 
> The consumer currently only works with a VFIO_PLATFORM producer.
> In other cases, start/stop do nothing and return without error
> to avoid outputting a spurious warning.
> 
> Signed-off-by: Eric Auger 
> 
> ---
> 
> v1 -> v2:
> - check the type of the producer to avoid attachement with VFIO-PCI
>   dummy MSI producer
> ---
>  arch/arm/kvm/Kconfig   |  3 +++
>  arch/arm64/kvm/Kconfig |  3 +++
>  virt/kvm/arm/arm.c | 48 
>  3 files changed, 54 insertions(+)
> 
> diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig
> index 90d0176..4e2b192 100644
> --- a/arch/arm/kvm/Kconfig
> +++ b/arch/arm/kvm/Kconfig
> @@ -3,6 +3,7 @@
>  #
>  
>  source "virt/kvm/Kconfig"
> +source "virt/lib/Kconfig"
>  
>  menuconfig VIRTUALIZATION
>   bool "Virtualization"
> @@ -35,6 +36,8 @@ config KVM
>   select HAVE_KVM_IRQCHIP
>   select HAVE_KVM_IRQ_ROUTING
>   select HAVE_KVM_MSI
> + select IRQ_BYPASS_MANAGER
> + select HAVE_KVM_IRQ_BYPASS
>   depends on ARM_VIRT_EXT && ARM_LPAE && ARM_ARCH_TIMER
>   ---help---
> Support hosting virtualized guest machines.
> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index 52cb7ad..7e0d6e6 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -3,6 +3,7 @@
>  #
>  
>  source "virt/kvm/Kconfig"
> +source "virt/lib/Kconfig"
>  
>  menuconfig VIRTUALIZATION
>   bool "Virtualization"
> @@ -35,6 +36,8 @@ config KVM
>   select HAVE_KVM_MSI
>   select HAVE_KVM_IRQCHIP
>   select HAVE_KVM_IRQ_ROUTING
> + select IRQ_BYPASS_MANAGER
> + select HAVE_KVM_IRQ_BYPASS
>   ---help---
> Support hosting virtualized guest machines.
> We don't support KVM with 16K page tables yet, due to the multiple
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index 3417e18..58871f8 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -27,6 +27,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
>  
> @@ -1420,6 +1422,52 @@ struct kvm_vcpu *kvm_mpidr_to_vcpu(struct kvm *kvm, 
> unsigned long mpidr)
>   return NULL;
>  }
>  
> +bool kvm_arch_has_irq_bypass(void)
> +{
> + return true;
> +}
> +
> +int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
> +   struct irq_bypass_producer *prod)
> +{
> + struct kvm_kernel_irqfd *irqfd =
> + container_of(cons, struct kvm_kernel_irqfd, consumer);
> +
> + if (prod->type != IRQ_BYPASS_VFIO_PLATFORM)
> + return 0;
> +
> + return kvm_vgic_set_forwarding(irqfd->kvm, prod->irq,
> +irqfd->gsi + VGIC_NR_PRIVATE_IRQS);
> +}
> +void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
> +   struct irq_bypass_producer *prod)
> +{
> + struct kvm_kernel_irqfd *irqfd =
> + container_of(cons, struct kvm_kernel_irqfd, consumer);
> +
> + if (prod->type != IRQ_BYPASS_VFIO_PLATFORM)
> + return;
> +
> + kvm_vgic_unset_forwarding(irqfd->kvm, prod->irq,
> +   irqfd->gsi + VGIC_NR_PRIVATE_IRQS);
> +}
> +
> +void kvm_arch_irq_bypass_stop(struct irq_bypass_consumer *cons)
> +{
> + struct kvm_kernel_irqfd *irqfd =
> + container_of(cons, struct kvm_kernel_irqfd, consumer);
> +
> + kvm_arm_halt_guest(irqfd->kvm);
> +}
> +
> +void kvm_arch_irq_bypass_start(struct irq_bypass_consumer *cons)
> +{
> + struct kvm_kernel_irqfd *irqfd =
> + container_of(cons, struct kvm_kernel_irqfd, consumer);
> +
> + kvm_arm_resume_guest(irqfd->kvm);
> +}
> +
>  /**
>   * Initialize Hyp-mode and memory mappings on all CPUs.
>   */
> -- 
> 2.5.5

Nit: I'm wondering if this should go in a different file, like
virt/kvm/arm/irqbypass.c, or virt/kvm/arm/vgic/vgic-irqbypass.c ?

Thanks,
-Christoffer


Re: [PATCH v2 8/8] KVM: arm/arm64: register DEOI irq bypass consumer on ARM/ARM64

2017-07-21 Thread Christoffer Dall
On Thu, Jun 15, 2017 at 02:52:40PM +0200, Eric Auger wrote:
> This patch selects IRQ_BYPASS_MANAGER and HAVE_KVM_IRQ_BYPASS
> configs for ARM/ARM64.
> 
> kvm_arch_has_irq_bypass() now is implemented and returns true.
> As a consequence the irq bypass consumer will be registered for
> ARM/ARM64 with Direct EOI/IRQ forwarding callbacks:
> 
> - stop/start: halt/resume guest execution
> - add/del_producer: set/unset forwarding/DEOI at vgic/irqchip level
> 
> The consumer currently only works with a VFIO_PLATFORM producer.
> In other cases, start/stop do nothing and return without error
> to avoid outputting a spurious warning.
> 
> Signed-off-by: Eric Auger 
> 
> ---
> 
> v1 -> v2:
> - check the type of the producer to avoid attachement with VFIO-PCI
>   dummy MSI producer
> ---
>  arch/arm/kvm/Kconfig   |  3 +++
>  arch/arm64/kvm/Kconfig |  3 +++
>  virt/kvm/arm/arm.c | 48 
>  3 files changed, 54 insertions(+)
> 
> diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig
> index 90d0176..4e2b192 100644
> --- a/arch/arm/kvm/Kconfig
> +++ b/arch/arm/kvm/Kconfig
> @@ -3,6 +3,7 @@
>  #
>  
>  source "virt/kvm/Kconfig"
> +source "virt/lib/Kconfig"
>  
>  menuconfig VIRTUALIZATION
>   bool "Virtualization"
> @@ -35,6 +36,8 @@ config KVM
>   select HAVE_KVM_IRQCHIP
>   select HAVE_KVM_IRQ_ROUTING
>   select HAVE_KVM_MSI
> + select IRQ_BYPASS_MANAGER
> + select HAVE_KVM_IRQ_BYPASS
>   depends on ARM_VIRT_EXT && ARM_LPAE && ARM_ARCH_TIMER
>   ---help---
> Support hosting virtualized guest machines.
> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index 52cb7ad..7e0d6e6 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -3,6 +3,7 @@
>  #
>  
>  source "virt/kvm/Kconfig"
> +source "virt/lib/Kconfig"
>  
>  menuconfig VIRTUALIZATION
>   bool "Virtualization"
> @@ -35,6 +36,8 @@ config KVM
>   select HAVE_KVM_MSI
>   select HAVE_KVM_IRQCHIP
>   select HAVE_KVM_IRQ_ROUTING
> + select IRQ_BYPASS_MANAGER
> + select HAVE_KVM_IRQ_BYPASS
>   ---help---
> Support hosting virtualized guest machines.
> We don't support KVM with 16K page tables yet, due to the multiple
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index 3417e18..58871f8 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -27,6 +27,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
>  
> @@ -1420,6 +1422,52 @@ struct kvm_vcpu *kvm_mpidr_to_vcpu(struct kvm *kvm, 
> unsigned long mpidr)
>   return NULL;
>  }
>  
> +bool kvm_arch_has_irq_bypass(void)
> +{
> + return true;
> +}
> +
> +int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
> +   struct irq_bypass_producer *prod)
> +{
> + struct kvm_kernel_irqfd *irqfd =
> + container_of(cons, struct kvm_kernel_irqfd, consumer);
> +
> + if (prod->type != IRQ_BYPASS_VFIO_PLATFORM)
> + return 0;
> +
> + return kvm_vgic_set_forwarding(irqfd->kvm, prod->irq,
> +irqfd->gsi + VGIC_NR_PRIVATE_IRQS);
> +}
> +void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
> +   struct irq_bypass_producer *prod)
> +{
> + struct kvm_kernel_irqfd *irqfd =
> + container_of(cons, struct kvm_kernel_irqfd, consumer);
> +
> + if (prod->type != IRQ_BYPASS_VFIO_PLATFORM)
> + return;
> +
> + kvm_vgic_unset_forwarding(irqfd->kvm, prod->irq,
> +   irqfd->gsi + VGIC_NR_PRIVATE_IRQS);
> +}
> +
> +void kvm_arch_irq_bypass_stop(struct irq_bypass_consumer *cons)
> +{
> + struct kvm_kernel_irqfd *irqfd =
> + container_of(cons, struct kvm_kernel_irqfd, consumer);
> +
> + kvm_arm_halt_guest(irqfd->kvm);
> +}
> +
> +void kvm_arch_irq_bypass_start(struct irq_bypass_consumer *cons)
> +{
> + struct kvm_kernel_irqfd *irqfd =
> + container_of(cons, struct kvm_kernel_irqfd, consumer);
> +
> + kvm_arm_resume_guest(irqfd->kvm);
> +}
> +
>  /**
>   * Initialize Hyp-mode and memory mappings on all CPUs.
>   */
> -- 
> 2.5.5

Nit: I'm wondering if this should go in a different file, like
virt/kvm/arm/irqbypass.c, or virt/kvm/arm/vgic/vgic-irqbypass.c ?

Thanks,
-Christoffer


Re: [PATCH v5 0/2] add UniPhier thermal support

2017-07-21 Thread Masahiro Yamada
2017-07-21 20:21 GMT+09:00 Kunihiko Hayashi :
> This series adds support for CPU temperature monitor modules implemented
> on UniPhier LD20 and PXs2 SoCs. This driver supports temperature monitoring
> and alert function on the module.
>
> Changes in v4:
> - fix warnings from sparse by replacing u32 with __be32

Nit.

Your subject prefix indicates this patch is v5.

Do you mean "Changes in v5" or "Changes since v4" ?






> Changes in v3:
> - remove TMOD_MASK and use TMOD_WIDTH representing the bit width of TMOD
>
> Changes in v2:
> - add nsleep after starting and stopping PVT
> - replace temperature calculation with sign_extend32()
>
> Changes in v1:
> - separate dts from this patchset as another patchset
> - remove 'reg' description on the dt-bindings document
> - fix the order of calling initialization functions
> - replace mask bits to use GENMASK
> - fix calculation of temperature because of not considering a negative value
> - use devm_request_threaded_irq() instead of devm_request_irq() and
>   separate a thread function from the interrupt handler
> - add dependency to Kconfig
> - set 120C to CRITICAL_TEMP_LIMIT as maximum temperature
> - shrink each line of parameters to save the number of lines
> - improve some comments and copyright description
>
> Kunihiko Hayashi (2):
>   dt-bindings: thermal: add binding documentation for UniPhier thermal
> monitor
>   thermal: uniphier: add UniPhier thermal driver
>
>  .../bindings/thermal/uniphier-thermal.txt  |  64 
>  drivers/thermal/Kconfig|   8 +
>  drivers/thermal/Makefile   |   1 +
>  drivers/thermal/uniphier_thermal.c | 386 
> +
>  4 files changed, 459 insertions(+)
>  create mode 100644 
> Documentation/devicetree/bindings/thermal/uniphier-thermal.txt
>  create mode 100644 drivers/thermal/uniphier_thermal.c
>
> --
> 2.7.4
>



-- 
Best Regards
Masahiro Yamada


Re: [PATCH v5 0/2] add UniPhier thermal support

2017-07-21 Thread Masahiro Yamada
2017-07-21 20:21 GMT+09:00 Kunihiko Hayashi :
> This series adds support for CPU temperature monitor modules implemented
> on UniPhier LD20 and PXs2 SoCs. This driver supports temperature monitoring
> and alert function on the module.
>
> Changes in v4:
> - fix warnings from sparse by replacing u32 with __be32

Nit.

Your subject prefix indicates this patch is v5.

Do you mean "Changes in v5" or "Changes since v4" ?






> Changes in v3:
> - remove TMOD_MASK and use TMOD_WIDTH representing the bit width of TMOD
>
> Changes in v2:
> - add nsleep after starting and stopping PVT
> - replace temperature calculation with sign_extend32()
>
> Changes in v1:
> - separate dts from this patchset as another patchset
> - remove 'reg' description on the dt-bindings document
> - fix the order of calling initialization functions
> - replace mask bits to use GENMASK
> - fix calculation of temperature because of not considering a negative value
> - use devm_request_threaded_irq() instead of devm_request_irq() and
>   separate a thread function from the interrupt handler
> - add dependency to Kconfig
> - set 120C to CRITICAL_TEMP_LIMIT as maximum temperature
> - shrink each line of parameters to save the number of lines
> - improve some comments and copyright description
>
> Kunihiko Hayashi (2):
>   dt-bindings: thermal: add binding documentation for UniPhier thermal
> monitor
>   thermal: uniphier: add UniPhier thermal driver
>
>  .../bindings/thermal/uniphier-thermal.txt  |  64 
>  drivers/thermal/Kconfig|   8 +
>  drivers/thermal/Makefile   |   1 +
>  drivers/thermal/uniphier_thermal.c | 386 
> +
>  4 files changed, 459 insertions(+)
>  create mode 100644 
> Documentation/devicetree/bindings/thermal/uniphier-thermal.txt
>  create mode 100644 drivers/thermal/uniphier_thermal.c
>
> --
> 2.7.4
>



-- 
Best Regards
Masahiro Yamada


Re: [PATCH] Revert "x86/uaccess: Add stack frame output operand in get_user() inline asm"

2017-07-21 Thread Josh Poimboeuf
On Fri, Jul 21, 2017 at 12:13:31PM +0300, Andrey Ryabinin wrote:
> > Still, unfortunately, I don't think that's going to work for GCC.
> > Changing the '__sp' register variable to global in the header file
> > causes it to make a *bunch* of changes across the kernel, even in
> > functions which don't do inline asm.  It seems to be disabling some
> > optimizations across the board.
> 
> All I see is just bunch of reordering of independent instructions, like this:
> 
> -81012760:  5b  pop%rbx
> -81012761:  31 c0   xor%eax,%eax
> +81012760:  31 c0   xor%eax,%eax
> +81012762:  5b  pop%rbx
> 
> -810c29ae:  48 83 c4 28 add$0x28,%rsp
> -810c29b2:  89 d8   mov%ebx,%eax
> +810c29ae:  89 d8   mov%ebx,%eax
> +810c29b0:  48 83 c4 28 add$0x28,%rsp
> 
> I haven't noticed any single bad/harmful change. The size of .text remained 
> the same. 

I compiled with -ffunction-sections to make the comparisons easier.  The
reordering is much more extreme than your example.  (This is with GCC 7,
btw).  And it's not just reordering of instructions.  It's control flow
changes as well.

Also, the text size grew a little:

  text data  bss dechex filename
  10630602  8295074 164618243538750021bf86c vmlinux.before
  10634013  8295074 164618243539091121c05bf vmlinux.after

A small two-line change, which is supposed to be a noop, or at least
should only affect a small number of functions, but which instead
affects optimization decisions across the entire kernel, is actively
harmful IMO.

> And btw, arm/arm64 already use global current_stack_pointer just fine.

I wonder if they looked for the impact.

-- 
Josh


Re: [PATCH] Revert "x86/uaccess: Add stack frame output operand in get_user() inline asm"

2017-07-21 Thread Josh Poimboeuf
On Fri, Jul 21, 2017 at 12:13:31PM +0300, Andrey Ryabinin wrote:
> > Still, unfortunately, I don't think that's going to work for GCC.
> > Changing the '__sp' register variable to global in the header file
> > causes it to make a *bunch* of changes across the kernel, even in
> > functions which don't do inline asm.  It seems to be disabling some
> > optimizations across the board.
> 
> All I see is just bunch of reordering of independent instructions, like this:
> 
> -81012760:  5b  pop%rbx
> -81012761:  31 c0   xor%eax,%eax
> +81012760:  31 c0   xor%eax,%eax
> +81012762:  5b  pop%rbx
> 
> -810c29ae:  48 83 c4 28 add$0x28,%rsp
> -810c29b2:  89 d8   mov%ebx,%eax
> +810c29ae:  89 d8   mov%ebx,%eax
> +810c29b0:  48 83 c4 28 add$0x28,%rsp
> 
> I haven't noticed any single bad/harmful change. The size of .text remained 
> the same. 

I compiled with -ffunction-sections to make the comparisons easier.  The
reordering is much more extreme than your example.  (This is with GCC 7,
btw).  And it's not just reordering of instructions.  It's control flow
changes as well.

Also, the text size grew a little:

  text data  bss dechex filename
  10630602  8295074 164618243538750021bf86c vmlinux.before
  10634013  8295074 164618243539091121c05bf vmlinux.after

A small two-line change, which is supposed to be a noop, or at least
should only affect a small number of functions, but which instead
affects optimization decisions across the entire kernel, is actively
harmful IMO.

> And btw, arm/arm64 already use global current_stack_pointer just fine.

I wonder if they looked for the impact.

-- 
Josh


[RFC PATCH 4/9] housekeeping: Make housekeeping cpumask private

2017-07-21 Thread Frederic Weisbecker
Nobody needs to access this detail. housekeeping_cpumask() already
takes care about it.

Signed-off-by: Frederic Weisbecker 
Cc: Chris Metcalf 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Mike Galbraith 
Cc: Ingo Molnar 
Cc: Christoph Lameter 
Cc: Paul E. McKenney 
Cc: Wanpeng Li 
Cc: Luiz Capitulino 
---
 include/linux/housekeeping.h | 31 ++-
 kernel/housekeeping.c| 33 -
 2 files changed, 42 insertions(+), 22 deletions(-)

diff --git a/include/linux/housekeeping.h b/include/linux/housekeeping.h
index 64d0ee5..31a1401 100644
--- a/include/linux/housekeeping.h
+++ b/include/linux/housekeeping.h
@@ -6,46 +6,35 @@
 #include 
 
 #ifdef CONFIG_NO_HZ_FULL
-extern cpumask_var_t housekeeping_mask;
+extern int housekeeping_any_cpu(void);
+extern const struct cpumask *housekeeping_cpumask(void);
+extern void housekeeping_affine(struct task_struct *t);
+extern bool housekeeping_test_cpu(int cpu);
 extern void __init housekeeping_init(void);
+
 #else
-static inline void housekeeping_init(void) { }
-#endif /* CONFIG_NO_HZ_FULL */
 
 static inline int housekeeping_any_cpu(void)
 {
-#ifdef CONFIG_NO_HZ_FULL
-   if (tick_nohz_full_enabled())
-   return cpumask_any_and(housekeeping_mask, cpu_online_mask);
-#endif
return smp_processor_id();
 }
 
 static inline const struct cpumask *housekeeping_cpumask(void)
 {
-#ifdef CONFIG_NO_HZ_FULL
-   if (tick_nohz_full_enabled())
-   return housekeeping_mask;
-#endif
return cpu_possible_mask;
 }
 
+static inline void housekeeping_affine(struct task_struct *t) { }
+static inline void housekeeping_init(void) { }
+#endif /* CONFIG_NO_HZ_FULL */
+
 static inline bool is_housekeeping_cpu(int cpu)
 {
 #ifdef CONFIG_NO_HZ_FULL
if (tick_nohz_full_enabled())
-   return cpumask_test_cpu(cpu, housekeeping_mask);
+   return housekeeping_test_cpu(cpu);
 #endif
return true;
 }
 
-static inline void housekeeping_affine(struct task_struct *t)
-{
-#ifdef CONFIG_NO_HZ_FULL
-   if (tick_nohz_full_enabled())
-   set_cpus_allowed_ptr(t, housekeeping_mask);
-
-#endif
-}
-
 #endif /* _LINUX_HOUSEKEEPING_H */
diff --git a/kernel/housekeeping.c b/kernel/housekeeping.c
index 6d8afd5..0183e75 100644
--- a/kernel/housekeeping.c
+++ b/kernel/housekeeping.c
@@ -10,7 +10,38 @@
 #include 
 #include 
 
-cpumask_var_t housekeeping_mask;
+static cpumask_var_t housekeeping_mask;
+
+int housekeeping_any_cpu(void)
+{
+   if (tick_nohz_full_enabled())
+   return cpumask_any_and(housekeeping_mask, cpu_online_mask);
+
+   return smp_processor_id();
+}
+
+const struct cpumask *housekeeping_cpumask(void)
+{
+   if (tick_nohz_full_enabled())
+   return housekeeping_mask;
+
+   return cpu_possible_mask;
+}
+
+void housekeeping_affine(struct task_struct *t)
+{
+   if (tick_nohz_full_enabled())
+   set_cpus_allowed_ptr(t, housekeeping_mask);
+}
+
+bool housekeeping_test_cpu(int cpu)
+{
+   if (tick_nohz_full_enabled())
+   return cpumask_test_cpu(cpu, housekeeping_mask);
+
+   return true;
+}
+
 
 void __init housekeeping_init(void)
 {
-- 
2.7.4



[RFC PATCH 4/9] housekeeping: Make housekeeping cpumask private

2017-07-21 Thread Frederic Weisbecker
Nobody needs to access this detail. housekeeping_cpumask() already
takes care about it.

Signed-off-by: Frederic Weisbecker 
Cc: Chris Metcalf 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Mike Galbraith 
Cc: Ingo Molnar 
Cc: Christoph Lameter 
Cc: Paul E. McKenney 
Cc: Wanpeng Li 
Cc: Luiz Capitulino 
---
 include/linux/housekeeping.h | 31 ++-
 kernel/housekeeping.c| 33 -
 2 files changed, 42 insertions(+), 22 deletions(-)

diff --git a/include/linux/housekeeping.h b/include/linux/housekeeping.h
index 64d0ee5..31a1401 100644
--- a/include/linux/housekeeping.h
+++ b/include/linux/housekeeping.h
@@ -6,46 +6,35 @@
 #include 
 
 #ifdef CONFIG_NO_HZ_FULL
-extern cpumask_var_t housekeeping_mask;
+extern int housekeeping_any_cpu(void);
+extern const struct cpumask *housekeeping_cpumask(void);
+extern void housekeeping_affine(struct task_struct *t);
+extern bool housekeeping_test_cpu(int cpu);
 extern void __init housekeeping_init(void);
+
 #else
-static inline void housekeeping_init(void) { }
-#endif /* CONFIG_NO_HZ_FULL */
 
 static inline int housekeeping_any_cpu(void)
 {
-#ifdef CONFIG_NO_HZ_FULL
-   if (tick_nohz_full_enabled())
-   return cpumask_any_and(housekeeping_mask, cpu_online_mask);
-#endif
return smp_processor_id();
 }
 
 static inline const struct cpumask *housekeeping_cpumask(void)
 {
-#ifdef CONFIG_NO_HZ_FULL
-   if (tick_nohz_full_enabled())
-   return housekeeping_mask;
-#endif
return cpu_possible_mask;
 }
 
+static inline void housekeeping_affine(struct task_struct *t) { }
+static inline void housekeeping_init(void) { }
+#endif /* CONFIG_NO_HZ_FULL */
+
 static inline bool is_housekeeping_cpu(int cpu)
 {
 #ifdef CONFIG_NO_HZ_FULL
if (tick_nohz_full_enabled())
-   return cpumask_test_cpu(cpu, housekeeping_mask);
+   return housekeeping_test_cpu(cpu);
 #endif
return true;
 }
 
-static inline void housekeeping_affine(struct task_struct *t)
-{
-#ifdef CONFIG_NO_HZ_FULL
-   if (tick_nohz_full_enabled())
-   set_cpus_allowed_ptr(t, housekeeping_mask);
-
-#endif
-}
-
 #endif /* _LINUX_HOUSEKEEPING_H */
diff --git a/kernel/housekeeping.c b/kernel/housekeeping.c
index 6d8afd5..0183e75 100644
--- a/kernel/housekeeping.c
+++ b/kernel/housekeeping.c
@@ -10,7 +10,38 @@
 #include 
 #include 
 
-cpumask_var_t housekeeping_mask;
+static cpumask_var_t housekeeping_mask;
+
+int housekeeping_any_cpu(void)
+{
+   if (tick_nohz_full_enabled())
+   return cpumask_any_and(housekeeping_mask, cpu_online_mask);
+
+   return smp_processor_id();
+}
+
+const struct cpumask *housekeeping_cpumask(void)
+{
+   if (tick_nohz_full_enabled())
+   return housekeeping_mask;
+
+   return cpu_possible_mask;
+}
+
+void housekeeping_affine(struct task_struct *t)
+{
+   if (tick_nohz_full_enabled())
+   set_cpus_allowed_ptr(t, housekeeping_mask);
+}
+
+bool housekeeping_test_cpu(int cpu)
+{
+   if (tick_nohz_full_enabled())
+   return cpumask_test_cpu(cpu, housekeeping_mask);
+
+   return true;
+}
+
 
 void __init housekeeping_init(void)
 {
-- 
2.7.4



Re: [PATCH v5 2/2] thermal: uniphier: add UniPhier thermal driver

2017-07-21 Thread Masahiro Yamada
2017-07-21 20:21 GMT+09:00 Kunihiko Hayashi :

> +static int uniphier_tm_initialize_sensor(struct uniphier_tm_dev *tdev)
> +{
> +   struct regmap *map = tdev->regmap;
> +   u32 val;
> +   int ret;
> +
> +   /* stop PVT */
> +   regmap_write_bits(map, tdev->data->block_base + PVTCTLEN,
> + PVTCTLEN_EN, 0);
> +
> +   /*
> +* set default value if missing calibrated value
> +*
> +* Since SoC has a calibrated value that was set in advance,
> +* TMODCOEF shows non-zero and PVT refers the value internally.
> +*
> +* However, some boards don't have the calibrated value.
> +* In that case, TMODCOEF shows zero and the driver has to set
> +* default value manually.
> +*/
> +   ret = regmap_read(map, tdev->data->map_base + TMODCOEF, );
> +   if (ret)
> +   return ret;
> +   if (!val)
> +   regmap_write(map, tdev->data->tmod_setup_addr,
> +   TMODSETUP0_EN | TMODSETUP0_VAL(tdev->tmod_calib0) 
> |
> +   TMODSETUP1_EN | 
> TMODSETUP1_VAL(tdev->tmod_calib1));


This code is strange.

What if TMODCOEF has no calibrated value and
"socionext,tmod-calibration" is not set either?


->tmod_setupaddr will be set to zero
and the sensor would not work, right?




> +   /* get tmod-calibration values */
> +   calib = of_get_property(dev->of_node, "socionext,tmod-calibration",
> +   NULL);
> +   if (calib) {
> +   tdev->tmod_calib0 = of_read_number(calib, 1);
> +   tdev->tmod_calib1 = of_read_number(calib + 1, 1);
> +   }


>From your DT change (https://patchwork.kernel.org/patch/9826391/),
this property seems a pair of u32 values, like follows:

socionext,tmod-calibration = <0x0f22 0x68ee>;


Why do you need to use of_read_number() to retrieve each u32 value?

See the comment and return value (u64):

-->8-
/* Helper to read a big number; size is in cells (not bytes) */
static inline u64 of_read_number(const __be32 *cell, int size)
--8<-


Also, you are not checking the length of the property.
of_read_number() may over-run the property.



of_property_read_u32_array() will be a better choice.


I'd propose the code like follows:





diff --git a/drivers/thermal/uniphier_thermal.c
b/drivers/thermal/uniphier_thermal.c
index 1a77c5bf6930..eea8a3053584 100644
--- a/drivers/thermal/uniphier_thermal.c
+++ b/drivers/thermal/uniphier_thermal.c
@@ -94,16 +94,16 @@ struct uniphier_tm_soc_data {
 struct uniphier_tm_dev {
struct regmap *regmap;
bool alert_en[ALERT_CH_NUM];
-   u32 tmod_calib0;
-   u32 tmod_calib1;
struct thermal_zone_device *tz_dev;
const struct uniphier_tm_soc_data *data;
 };

-static int uniphier_tm_initialize_sensor(struct uniphier_tm_dev *tdev)
+static int uniphier_tm_initialize_sensor(struct uniphier_tm_dev *tdev,
+struct device *dev)
 {
struct regmap *map = tdev->regmap;
u32 val;
+   u32 tmod_val[2];
int ret;

/* stop PVT */
@@ -123,10 +123,22 @@ static int uniphier_tm_initialize_sensor(struct
uniphier_tm_dev *tdev)
ret = regmap_read(map, tdev->data->map_base + TMODCOEF, );
if (ret)
return ret;
-   if (!val)
+   if (!val) {
+   /*
+* No preset value found in the register.
+* Look for the fall-back values in DT.
+*/
+   ret = of_property_read_u32_array(dev->of_node,
+"socionext,tmod-calibration",
+tmod_val,
ARRAY_SIZE(tmod_val));
+   if (ret) {
+   dev_err(dev, "neither register nor DT has
calibrated values.\n");
+   return ret;
+   }
regmap_write(map, tdev->data->tmod_setup_addr,
-   TMODSETUP0_EN | TMODSETUP0_VAL(tdev->tmod_calib0) |
-   TMODSETUP1_EN | TMODSETUP1_VAL(tdev->tmod_calib1));
+TMODSETUP0_EN | TMODSETUP0_VAL(tmod_val[0]) |
+TMODSETUP1_EN | TMODSETUP1_VAL(tmod_val[1]));
+   }

/* select temperature mode */
regmap_write_bits(map, tdev->data->block_base + PVTCTLMODE,
@@ -254,7 +266,6 @@ static int uniphier_tm_probe(struct platform_device *pdev)
struct device_node *parent;
struct uniphier_tm_dev *tdev;
const struct thermal_trip *trips;
-   const __be32 *calib;
int i, ret, irq, ntrips, crit_temp = INT_MAX;

tdev = devm_kzalloc(dev, sizeof(*tdev), GFP_KERNEL);
@@ -269,14 +280,6 @@ static int uniphier_tm_probe(struct platform_device *pdev)
if (irq < 0)
return irq;

-   

Re: [PATCH v5 2/2] thermal: uniphier: add UniPhier thermal driver

2017-07-21 Thread Masahiro Yamada
2017-07-21 20:21 GMT+09:00 Kunihiko Hayashi :

> +static int uniphier_tm_initialize_sensor(struct uniphier_tm_dev *tdev)
> +{
> +   struct regmap *map = tdev->regmap;
> +   u32 val;
> +   int ret;
> +
> +   /* stop PVT */
> +   regmap_write_bits(map, tdev->data->block_base + PVTCTLEN,
> + PVTCTLEN_EN, 0);
> +
> +   /*
> +* set default value if missing calibrated value
> +*
> +* Since SoC has a calibrated value that was set in advance,
> +* TMODCOEF shows non-zero and PVT refers the value internally.
> +*
> +* However, some boards don't have the calibrated value.
> +* In that case, TMODCOEF shows zero and the driver has to set
> +* default value manually.
> +*/
> +   ret = regmap_read(map, tdev->data->map_base + TMODCOEF, );
> +   if (ret)
> +   return ret;
> +   if (!val)
> +   regmap_write(map, tdev->data->tmod_setup_addr,
> +   TMODSETUP0_EN | TMODSETUP0_VAL(tdev->tmod_calib0) 
> |
> +   TMODSETUP1_EN | 
> TMODSETUP1_VAL(tdev->tmod_calib1));


This code is strange.

What if TMODCOEF has no calibrated value and
"socionext,tmod-calibration" is not set either?


->tmod_setupaddr will be set to zero
and the sensor would not work, right?




> +   /* get tmod-calibration values */
> +   calib = of_get_property(dev->of_node, "socionext,tmod-calibration",
> +   NULL);
> +   if (calib) {
> +   tdev->tmod_calib0 = of_read_number(calib, 1);
> +   tdev->tmod_calib1 = of_read_number(calib + 1, 1);
> +   }


>From your DT change (https://patchwork.kernel.org/patch/9826391/),
this property seems a pair of u32 values, like follows:

socionext,tmod-calibration = <0x0f22 0x68ee>;


Why do you need to use of_read_number() to retrieve each u32 value?

See the comment and return value (u64):

-->8-
/* Helper to read a big number; size is in cells (not bytes) */
static inline u64 of_read_number(const __be32 *cell, int size)
--8<-


Also, you are not checking the length of the property.
of_read_number() may over-run the property.



of_property_read_u32_array() will be a better choice.


I'd propose the code like follows:





diff --git a/drivers/thermal/uniphier_thermal.c
b/drivers/thermal/uniphier_thermal.c
index 1a77c5bf6930..eea8a3053584 100644
--- a/drivers/thermal/uniphier_thermal.c
+++ b/drivers/thermal/uniphier_thermal.c
@@ -94,16 +94,16 @@ struct uniphier_tm_soc_data {
 struct uniphier_tm_dev {
struct regmap *regmap;
bool alert_en[ALERT_CH_NUM];
-   u32 tmod_calib0;
-   u32 tmod_calib1;
struct thermal_zone_device *tz_dev;
const struct uniphier_tm_soc_data *data;
 };

-static int uniphier_tm_initialize_sensor(struct uniphier_tm_dev *tdev)
+static int uniphier_tm_initialize_sensor(struct uniphier_tm_dev *tdev,
+struct device *dev)
 {
struct regmap *map = tdev->regmap;
u32 val;
+   u32 tmod_val[2];
int ret;

/* stop PVT */
@@ -123,10 +123,22 @@ static int uniphier_tm_initialize_sensor(struct
uniphier_tm_dev *tdev)
ret = regmap_read(map, tdev->data->map_base + TMODCOEF, );
if (ret)
return ret;
-   if (!val)
+   if (!val) {
+   /*
+* No preset value found in the register.
+* Look for the fall-back values in DT.
+*/
+   ret = of_property_read_u32_array(dev->of_node,
+"socionext,tmod-calibration",
+tmod_val,
ARRAY_SIZE(tmod_val));
+   if (ret) {
+   dev_err(dev, "neither register nor DT has
calibrated values.\n");
+   return ret;
+   }
regmap_write(map, tdev->data->tmod_setup_addr,
-   TMODSETUP0_EN | TMODSETUP0_VAL(tdev->tmod_calib0) |
-   TMODSETUP1_EN | TMODSETUP1_VAL(tdev->tmod_calib1));
+TMODSETUP0_EN | TMODSETUP0_VAL(tmod_val[0]) |
+TMODSETUP1_EN | TMODSETUP1_VAL(tmod_val[1]));
+   }

/* select temperature mode */
regmap_write_bits(map, tdev->data->block_base + PVTCTLMODE,
@@ -254,7 +266,6 @@ static int uniphier_tm_probe(struct platform_device *pdev)
struct device_node *parent;
struct uniphier_tm_dev *tdev;
const struct thermal_trip *trips;
-   const __be32 *calib;
int i, ret, irq, ntrips, crit_temp = INT_MAX;

tdev = devm_kzalloc(dev, sizeof(*tdev), GFP_KERNEL);
@@ -269,14 +280,6 @@ static int uniphier_tm_probe(struct platform_device *pdev)
if (irq < 0)
return irq;

-   /* get tmod-calibration 

[RFC PATCH 2/9] watchdog: Use housekeeping_cpumask() instead of ad-hoc version

2017-07-21 Thread Frederic Weisbecker
While trying to disable the watchog on nohz_full CPUs, the watchdog
implements an ad-hoc version of housekeeping_cpumask(). Lets replace
those re-invented lines.

Signed-off-by: Frederic Weisbecker 
Cc: Chris Metcalf 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Mike Galbraith 
Cc: Ingo Molnar 
Cc: Christoph Lameter 
Cc: Paul E. McKenney 
Cc: Wanpeng Li 
Cc: Luiz Capitulino 
---
 kernel/watchdog.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 7a9df162..cdd0d11 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -941,15 +941,10 @@ void __init lockup_detector_init(void)
 {
set_sample_period();
 
-#ifdef CONFIG_NO_HZ_FULL
-   if (tick_nohz_full_enabled()) {
+   if (tick_nohz_full_enabled())
pr_info("Disabling watchdog on nohz_full cores by default\n");
-   cpumask_copy(_cpumask, housekeeping_mask);
-   } else
-   cpumask_copy(_cpumask, cpu_possible_mask);
-#else
-   cpumask_copy(_cpumask, cpu_possible_mask);
-#endif
+
+   cpumask_copy(_cpumask, housekeeping_cpumask());
 
if (watchdog_enabled)
watchdog_enable_all_cpus();
-- 
2.7.4



[RFC PATCH 2/9] watchdog: Use housekeeping_cpumask() instead of ad-hoc version

2017-07-21 Thread Frederic Weisbecker
While trying to disable the watchog on nohz_full CPUs, the watchdog
implements an ad-hoc version of housekeeping_cpumask(). Lets replace
those re-invented lines.

Signed-off-by: Frederic Weisbecker 
Cc: Chris Metcalf 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Mike Galbraith 
Cc: Ingo Molnar 
Cc: Christoph Lameter 
Cc: Paul E. McKenney 
Cc: Wanpeng Li 
Cc: Luiz Capitulino 
---
 kernel/watchdog.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 7a9df162..cdd0d11 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -941,15 +941,10 @@ void __init lockup_detector_init(void)
 {
set_sample_period();
 
-#ifdef CONFIG_NO_HZ_FULL
-   if (tick_nohz_full_enabled()) {
+   if (tick_nohz_full_enabled())
pr_info("Disabling watchdog on nohz_full cores by default\n");
-   cpumask_copy(_cpumask, housekeeping_mask);
-   } else
-   cpumask_copy(_cpumask, cpu_possible_mask);
-#else
-   cpumask_copy(_cpumask, cpu_possible_mask);
-#endif
+
+   cpumask_copy(_cpumask, housekeeping_cpumask());
 
if (watchdog_enabled)
watchdog_enable_all_cpus();
-- 
2.7.4



[RFC PATCH 6/9] housekeeping: Rename is_housekeeping_cpu to housekeeping_cpu

2017-07-21 Thread Frederic Weisbecker
To keep a proper housekeeping namespace.

Signed-off-by: Frederic Weisbecker 
Cc: Chris Metcalf 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Mike Galbraith 
Cc: Ingo Molnar 
Cc: Christoph Lameter 
Cc: Paul E. McKenney 
Cc: Wanpeng Li 
Cc: Luiz Capitulino 
---
 include/linux/housekeeping.h | 2 +-
 kernel/sched/core.c  | 6 +++---
 kernel/sched/fair.c  | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/housekeeping.h b/include/linux/housekeeping.h
index cbe8d63..320cc2b 100644
--- a/include/linux/housekeeping.h
+++ b/include/linux/housekeeping.h
@@ -29,7 +29,7 @@ static inline void housekeeping_affine(struct task_struct *t) 
{ }
 static inline void housekeeping_init(void) { }
 #endif /* CONFIG_NO_HZ_FULL */
 
-static inline bool is_housekeeping_cpu(int cpu)
+static inline bool housekeeping_cpu(int cpu)
 {
 #ifdef CONFIG_NO_HZ_FULL
if (static_branch_unlikely(_overriden))
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 751c040..adfee7f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -527,7 +527,7 @@ int get_nohz_timer_target(void)
int i, cpu = smp_processor_id();
struct sched_domain *sd;
 
-   if (!idle_cpu(cpu) && is_housekeeping_cpu(cpu))
+   if (!idle_cpu(cpu) && housekeeping_cpu(cpu))
return cpu;
 
rcu_read_lock();
@@ -536,14 +536,14 @@ int get_nohz_timer_target(void)
if (cpu == i)
continue;
 
-   if (!idle_cpu(i) && is_housekeeping_cpu(i)) {
+   if (!idle_cpu(i) && housekeeping_cpu(i)) {
cpu = i;
goto unlock;
}
}
}
 
-   if (!is_housekeeping_cpu(cpu))
+   if (!housekeeping_cpu(cpu))
cpu = housekeeping_any_cpu();
 unlock:
rcu_read_unlock();
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 05dfced..8ecd204 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8600,7 +8600,7 @@ void nohz_balance_enter_idle(int cpu)
return;
 
/* Spare idle load balancing on CPUs that don't want to be disturbed: */
-   if (!is_housekeeping_cpu(cpu))
+   if (!housekeeping_cpu(cpu))
return;
 
if (test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)))
-- 
2.7.4



[RFC PATCH 6/9] housekeeping: Rename is_housekeeping_cpu to housekeeping_cpu

2017-07-21 Thread Frederic Weisbecker
To keep a proper housekeeping namespace.

Signed-off-by: Frederic Weisbecker 
Cc: Chris Metcalf 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Mike Galbraith 
Cc: Ingo Molnar 
Cc: Christoph Lameter 
Cc: Paul E. McKenney 
Cc: Wanpeng Li 
Cc: Luiz Capitulino 
---
 include/linux/housekeeping.h | 2 +-
 kernel/sched/core.c  | 6 +++---
 kernel/sched/fair.c  | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/housekeeping.h b/include/linux/housekeeping.h
index cbe8d63..320cc2b 100644
--- a/include/linux/housekeeping.h
+++ b/include/linux/housekeeping.h
@@ -29,7 +29,7 @@ static inline void housekeeping_affine(struct task_struct *t) 
{ }
 static inline void housekeeping_init(void) { }
 #endif /* CONFIG_NO_HZ_FULL */
 
-static inline bool is_housekeeping_cpu(int cpu)
+static inline bool housekeeping_cpu(int cpu)
 {
 #ifdef CONFIG_NO_HZ_FULL
if (static_branch_unlikely(_overriden))
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 751c040..adfee7f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -527,7 +527,7 @@ int get_nohz_timer_target(void)
int i, cpu = smp_processor_id();
struct sched_domain *sd;
 
-   if (!idle_cpu(cpu) && is_housekeeping_cpu(cpu))
+   if (!idle_cpu(cpu) && housekeeping_cpu(cpu))
return cpu;
 
rcu_read_lock();
@@ -536,14 +536,14 @@ int get_nohz_timer_target(void)
if (cpu == i)
continue;
 
-   if (!idle_cpu(i) && is_housekeeping_cpu(i)) {
+   if (!idle_cpu(i) && housekeeping_cpu(i)) {
cpu = i;
goto unlock;
}
}
}
 
-   if (!is_housekeeping_cpu(cpu))
+   if (!housekeeping_cpu(cpu))
cpu = housekeeping_any_cpu();
 unlock:
rcu_read_unlock();
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 05dfced..8ecd204 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8600,7 +8600,7 @@ void nohz_balance_enter_idle(int cpu)
return;
 
/* Spare idle load balancing on CPUs that don't want to be disturbed: */
-   if (!is_housekeeping_cpu(cpu))
+   if (!housekeeping_cpu(cpu))
return;
 
if (test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)))
-- 
2.7.4



[RFC PATCH 8/9] housekeeping: Move it under own config, independant from NO_HZ

2017-07-21 Thread Frederic Weisbecker
Complete the housekeeping split from CONFIG_NO_HZ_FULL by moving it
under its own config. This way we finally separate the isolation code
from nohz.

Signed-off-by: Frederic Weisbecker 
Cc: Chris Metcalf 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Mike Galbraith 
Cc: Ingo Molnar 
Cc: Christoph Lameter 
Cc: Paul E. McKenney 
Cc: Wanpeng Li 
Cc: Luiz Capitulino 
---
 include/linux/housekeeping.h | 6 +++---
 init/Kconfig | 6 ++
 kernel/Makefile  | 2 +-
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/housekeeping.h b/include/linux/housekeeping.h
index ba769c8..d79a665 100644
--- a/include/linux/housekeeping.h
+++ b/include/linux/housekeeping.h
@@ -5,7 +5,7 @@
 #include 
 #include 
 
-#ifdef CONFIG_NO_HZ_FULL
+#ifdef CONFIG_HOUSEKEEPING
 DECLARE_STATIC_KEY_FALSE(housekeeping_overriden);
 extern int housekeeping_any_cpu(void);
 extern const struct cpumask *housekeeping_cpumask(void);
@@ -25,11 +25,11 @@ static inline const struct cpumask 
*housekeeping_cpumask(void)
 }
 
 static inline void housekeeping_affine(struct task_struct *t) { }
-#endif /* CONFIG_NO_HZ_FULL */
+#endif /* CONFIG_HOUSEKEEPING */
 
 static inline bool housekeeping_cpu(int cpu)
 {
-#ifdef CONFIG_NO_HZ_FULL
+#ifdef CONFIG_HOUSEKEEPING
if (static_branch_unlikely(_overriden))
return housekeeping_test_cpu(cpu);
 #endif
diff --git a/init/Kconfig b/init/Kconfig
index 8514b25..cc7af32 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -472,6 +472,12 @@ config TASK_IO_ACCOUNTING
 
 endmenu # "CPU/Task time and stats accounting"
 
+config HOUSEKEEPING
+   bool "Housekeeping work tuning"
+   help
+ Allow to affine and offload kernel internal routine jobs that are
+ usually executed on any CPU.
+
 source "kernel/rcu/Kconfig"
 
 config BUILD_BIN2C
diff --git a/kernel/Makefile b/kernel/Makefile
index 8a85c4b..58be459 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -109,7 +109,7 @@ obj-$(CONFIG_JUMP_LABEL) += jump_label.o
 obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
 obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
-obj-$(CONFIG_NO_HZ_FULL) += housekeeping.o
+obj-$(CONFIG_HOUSEKEEPING) += housekeeping.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
 
-- 
2.7.4



[RFC PATCH 8/9] housekeeping: Move it under own config, independant from NO_HZ

2017-07-21 Thread Frederic Weisbecker
Complete the housekeeping split from CONFIG_NO_HZ_FULL by moving it
under its own config. This way we finally separate the isolation code
from nohz.

Signed-off-by: Frederic Weisbecker 
Cc: Chris Metcalf 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Mike Galbraith 
Cc: Ingo Molnar 
Cc: Christoph Lameter 
Cc: Paul E. McKenney 
Cc: Wanpeng Li 
Cc: Luiz Capitulino 
---
 include/linux/housekeeping.h | 6 +++---
 init/Kconfig | 6 ++
 kernel/Makefile  | 2 +-
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/housekeeping.h b/include/linux/housekeeping.h
index ba769c8..d79a665 100644
--- a/include/linux/housekeeping.h
+++ b/include/linux/housekeeping.h
@@ -5,7 +5,7 @@
 #include 
 #include 
 
-#ifdef CONFIG_NO_HZ_FULL
+#ifdef CONFIG_HOUSEKEEPING
 DECLARE_STATIC_KEY_FALSE(housekeeping_overriden);
 extern int housekeeping_any_cpu(void);
 extern const struct cpumask *housekeeping_cpumask(void);
@@ -25,11 +25,11 @@ static inline const struct cpumask 
*housekeeping_cpumask(void)
 }
 
 static inline void housekeeping_affine(struct task_struct *t) { }
-#endif /* CONFIG_NO_HZ_FULL */
+#endif /* CONFIG_HOUSEKEEPING */
 
 static inline bool housekeeping_cpu(int cpu)
 {
-#ifdef CONFIG_NO_HZ_FULL
+#ifdef CONFIG_HOUSEKEEPING
if (static_branch_unlikely(_overriden))
return housekeeping_test_cpu(cpu);
 #endif
diff --git a/init/Kconfig b/init/Kconfig
index 8514b25..cc7af32 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -472,6 +472,12 @@ config TASK_IO_ACCOUNTING
 
 endmenu # "CPU/Task time and stats accounting"
 
+config HOUSEKEEPING
+   bool "Housekeeping work tuning"
+   help
+ Allow to affine and offload kernel internal routine jobs that are
+ usually executed on any CPU.
+
 source "kernel/rcu/Kconfig"
 
 config BUILD_BIN2C
diff --git a/kernel/Makefile b/kernel/Makefile
index 8a85c4b..58be459 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -109,7 +109,7 @@ obj-$(CONFIG_JUMP_LABEL) += jump_label.o
 obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
 obj-$(CONFIG_TORTURE_TEST) += torture.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
-obj-$(CONFIG_NO_HZ_FULL) += housekeeping.o
+obj-$(CONFIG_HOUSEKEEPING) += housekeeping.o
 
 obj-$(CONFIG_HAS_IOMEM) += memremap.o
 
-- 
2.7.4



[RFC PATCH 1/9] housekeeping: Move housekeeping related code to its own file

2017-07-21 Thread Frederic Weisbecker
The housekeeping code is currently tied to the nohz code. As we are
planning to make housekeeping independant from it, start with moving
the relevant code to its own file.

Signed-off-by: Frederic Weisbecker 
Cc: Chris Metcalf 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Mike Galbraith 
Cc: Ingo Molnar 
Cc: Christoph Lameter 
Cc: Paul E. McKenney 
Cc: Wanpeng Li 
Cc: Luiz Capitulino 
---
 drivers/net/ethernet/tile/tilegx.c |  2 +-
 include/linux/housekeeping.h   | 56 ++
 include/linux/tick.h   | 37 -
 init/main.c|  2 ++
 kernel/Makefile|  1 +
 kernel/housekeeping.c  | 32 ++
 kernel/rcu/tree_plugin.h   |  1 +
 kernel/rcu/update.c|  1 +
 kernel/sched/core.c|  1 +
 kernel/sched/fair.c|  1 +
 kernel/time/tick-sched.c   | 18 
 kernel/watchdog.c  |  1 +
 12 files changed, 97 insertions(+), 56 deletions(-)
 create mode 100644 include/linux/housekeeping.h
 create mode 100644 kernel/housekeeping.c

diff --git a/drivers/net/ethernet/tile/tilegx.c 
b/drivers/net/ethernet/tile/tilegx.c
index aec9538..eb74e09 100644
--- a/drivers/net/ethernet/tile/tilegx.c
+++ b/drivers/net/ethernet/tile/tilegx.c
@@ -40,7 +40,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 
 #include 
 #include 
diff --git a/include/linux/housekeeping.h b/include/linux/housekeeping.h
new file mode 100644
index 000..3d6a8e6
--- /dev/null
+++ b/include/linux/housekeeping.h
@@ -0,0 +1,56 @@
+#ifndef _LINUX_HOUSEKEEPING_H
+#define _LINUX_HOUSEKEEPING_H
+
+#include 
+#include 
+#include 
+
+#ifdef CONFIG_NO_HZ_FULL
+extern cpumask_var_t housekeeping_mask;
+
+static inline int housekeeping_any_cpu(void)
+{
+   return cpumask_any_and(housekeeping_mask, cpu_online_mask);
+}
+
+extern void __init housekeeping_init(void);
+
+#else
+
+static inline int housekeeping_any_cpu(void)
+{
+   return smp_processor_id();
+}
+
+static inline void housekeeping_init(void) { }
+#endif /* CONFIG_NO_HZ_FULL */
+
+
+static inline const struct cpumask *housekeeping_cpumask(void)
+{
+#ifdef CONFIG_NO_HZ_FULL
+   if (tick_nohz_full_enabled())
+   return housekeeping_mask;
+#endif
+   return cpu_possible_mask;
+}
+
+static inline bool is_housekeeping_cpu(int cpu)
+{
+#ifdef CONFIG_NO_HZ_FULL
+   if (tick_nohz_full_enabled())
+   return cpumask_test_cpu(cpu, housekeeping_mask);
+#endif
+   return true;
+}
+
+static inline void housekeeping_affine(struct task_struct *t)
+{
+#ifdef CONFIG_NO_HZ_FULL
+   if (tick_nohz_full_enabled())
+   set_cpus_allowed_ptr(t, housekeeping_mask);
+
+#endif
+}
+
+#endif /* _LINUX_HOUSEKEEPING_H */
diff --git a/include/linux/tick.h b/include/linux/tick.h
index fe01e68..68afc09 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -137,7 +137,6 @@ static inline u64 get_cpu_iowait_time_us(int cpu, u64 
*unused) { return -1; }
 #ifdef CONFIG_NO_HZ_FULL
 extern bool tick_nohz_full_running;
 extern cpumask_var_t tick_nohz_full_mask;
-extern cpumask_var_t housekeeping_mask;
 
 static inline bool tick_nohz_full_enabled(void)
 {
@@ -161,11 +160,6 @@ static inline void tick_nohz_full_add_cpus_to(struct 
cpumask *mask)
cpumask_or(mask, mask, tick_nohz_full_mask);
 }
 
-static inline int housekeeping_any_cpu(void)
-{
-   return cpumask_any_and(housekeeping_mask, cpu_online_mask);
-}
-
 extern void tick_nohz_dep_set(enum tick_dep_bits bit);
 extern void tick_nohz_dep_clear(enum tick_dep_bits bit);
 extern void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit);
@@ -235,10 +229,6 @@ static inline void tick_dep_clear_signal(struct 
signal_struct *signal,
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void __tick_nohz_task_switch(void);
 #else
-static inline int housekeeping_any_cpu(void)
-{
-   return smp_processor_id();
-}
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
 static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { }
@@ -260,33 +250,6 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { }
 static inline void __tick_nohz_task_switch(void) { }
 #endif
 
-static inline const struct cpumask *housekeeping_cpumask(void)
-{
-#ifdef CONFIG_NO_HZ_FULL
-   if (tick_nohz_full_enabled())
-   return housekeeping_mask;
-#endif
-   return cpu_possible_mask;
-}
-
-static inline bool is_housekeeping_cpu(int cpu)
-{
-#ifdef CONFIG_NO_HZ_FULL
-   if (tick_nohz_full_enabled())
-   return cpumask_test_cpu(cpu, housekeeping_mask);
-#endif

[RFC PATCH 1/9] housekeeping: Move housekeeping related code to its own file

2017-07-21 Thread Frederic Weisbecker
The housekeeping code is currently tied to the nohz code. As we are
planning to make housekeeping independant from it, start with moving
the relevant code to its own file.

Signed-off-by: Frederic Weisbecker 
Cc: Chris Metcalf 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Mike Galbraith 
Cc: Ingo Molnar 
Cc: Christoph Lameter 
Cc: Paul E. McKenney 
Cc: Wanpeng Li 
Cc: Luiz Capitulino 
---
 drivers/net/ethernet/tile/tilegx.c |  2 +-
 include/linux/housekeeping.h   | 56 ++
 include/linux/tick.h   | 37 -
 init/main.c|  2 ++
 kernel/Makefile|  1 +
 kernel/housekeeping.c  | 32 ++
 kernel/rcu/tree_plugin.h   |  1 +
 kernel/rcu/update.c|  1 +
 kernel/sched/core.c|  1 +
 kernel/sched/fair.c|  1 +
 kernel/time/tick-sched.c   | 18 
 kernel/watchdog.c  |  1 +
 12 files changed, 97 insertions(+), 56 deletions(-)
 create mode 100644 include/linux/housekeeping.h
 create mode 100644 kernel/housekeeping.c

diff --git a/drivers/net/ethernet/tile/tilegx.c 
b/drivers/net/ethernet/tile/tilegx.c
index aec9538..eb74e09 100644
--- a/drivers/net/ethernet/tile/tilegx.c
+++ b/drivers/net/ethernet/tile/tilegx.c
@@ -40,7 +40,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 
 #include 
 #include 
diff --git a/include/linux/housekeeping.h b/include/linux/housekeeping.h
new file mode 100644
index 000..3d6a8e6
--- /dev/null
+++ b/include/linux/housekeeping.h
@@ -0,0 +1,56 @@
+#ifndef _LINUX_HOUSEKEEPING_H
+#define _LINUX_HOUSEKEEPING_H
+
+#include 
+#include 
+#include 
+
+#ifdef CONFIG_NO_HZ_FULL
+extern cpumask_var_t housekeeping_mask;
+
+static inline int housekeeping_any_cpu(void)
+{
+   return cpumask_any_and(housekeeping_mask, cpu_online_mask);
+}
+
+extern void __init housekeeping_init(void);
+
+#else
+
+static inline int housekeeping_any_cpu(void)
+{
+   return smp_processor_id();
+}
+
+static inline void housekeeping_init(void) { }
+#endif /* CONFIG_NO_HZ_FULL */
+
+
+static inline const struct cpumask *housekeeping_cpumask(void)
+{
+#ifdef CONFIG_NO_HZ_FULL
+   if (tick_nohz_full_enabled())
+   return housekeeping_mask;
+#endif
+   return cpu_possible_mask;
+}
+
+static inline bool is_housekeeping_cpu(int cpu)
+{
+#ifdef CONFIG_NO_HZ_FULL
+   if (tick_nohz_full_enabled())
+   return cpumask_test_cpu(cpu, housekeeping_mask);
+#endif
+   return true;
+}
+
+static inline void housekeeping_affine(struct task_struct *t)
+{
+#ifdef CONFIG_NO_HZ_FULL
+   if (tick_nohz_full_enabled())
+   set_cpus_allowed_ptr(t, housekeeping_mask);
+
+#endif
+}
+
+#endif /* _LINUX_HOUSEKEEPING_H */
diff --git a/include/linux/tick.h b/include/linux/tick.h
index fe01e68..68afc09 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -137,7 +137,6 @@ static inline u64 get_cpu_iowait_time_us(int cpu, u64 
*unused) { return -1; }
 #ifdef CONFIG_NO_HZ_FULL
 extern bool tick_nohz_full_running;
 extern cpumask_var_t tick_nohz_full_mask;
-extern cpumask_var_t housekeeping_mask;
 
 static inline bool tick_nohz_full_enabled(void)
 {
@@ -161,11 +160,6 @@ static inline void tick_nohz_full_add_cpus_to(struct 
cpumask *mask)
cpumask_or(mask, mask, tick_nohz_full_mask);
 }
 
-static inline int housekeeping_any_cpu(void)
-{
-   return cpumask_any_and(housekeeping_mask, cpu_online_mask);
-}
-
 extern void tick_nohz_dep_set(enum tick_dep_bits bit);
 extern void tick_nohz_dep_clear(enum tick_dep_bits bit);
 extern void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit);
@@ -235,10 +229,6 @@ static inline void tick_dep_clear_signal(struct 
signal_struct *signal,
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void __tick_nohz_task_switch(void);
 #else
-static inline int housekeeping_any_cpu(void)
-{
-   return smp_processor_id();
-}
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
 static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { }
@@ -260,33 +250,6 @@ static inline void tick_nohz_full_kick_cpu(int cpu) { }
 static inline void __tick_nohz_task_switch(void) { }
 #endif
 
-static inline const struct cpumask *housekeeping_cpumask(void)
-{
-#ifdef CONFIG_NO_HZ_FULL
-   if (tick_nohz_full_enabled())
-   return housekeeping_mask;
-#endif
-   return cpu_possible_mask;
-}
-
-static inline bool is_housekeeping_cpu(int cpu)
-{
-#ifdef CONFIG_NO_HZ_FULL
-   if (tick_nohz_full_enabled())
-   return cpumask_test_cpu(cpu, housekeeping_mask);
-#endif
-   return true;
-}
-
-static inline void housekeeping_affine(struct task_struct *t)
-{
-#ifdef CONFIG_NO_HZ_FULL
-   if (tick_nohz_full_enabled())
-   set_cpus_allowed_ptr(t, housekeeping_mask);
-

[RFC PATCH 7/9] housekeeping: Use own boot option, independant from nohz

2017-07-21 Thread Frederic Weisbecker
The housekeeping is currently driven by nohz_full where any CPU that
is not in the nohz_full range is considered as a housekeeper. This is
a design mistake because nohz is just a detail among all the existing
isolation features. Nohz shouldn't imply anything else than tick related
things.

We rather want to drive all the isolation features from the housekeeping
subsystem which is responsible for all the work that can be either
affined (unpinned workqueues, timers, kthreads, ...) or offloaded
(scheduler tick, ...).

Let's start with a boot option to define the housekeepers. We should be
able to further enhance that through cpusets.

Signed-off-by: Frederic Weisbecker 
Cc: Chris Metcalf 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Mike Galbraith 
Cc: Ingo Molnar 
Cc: Christoph Lameter 
Cc: Paul E. McKenney 
Cc: Wanpeng Li 
Cc: Luiz Capitulino 
---
 include/linux/housekeeping.h |  2 --
 init/main.c  |  2 --
 kernel/housekeeping.c| 22 ++
 3 files changed, 10 insertions(+), 16 deletions(-)

diff --git a/include/linux/housekeeping.h b/include/linux/housekeeping.h
index 320cc2b..ba769c8 100644
--- a/include/linux/housekeeping.h
+++ b/include/linux/housekeeping.h
@@ -11,7 +11,6 @@ extern int housekeeping_any_cpu(void);
 extern const struct cpumask *housekeeping_cpumask(void);
 extern void housekeeping_affine(struct task_struct *t);
 extern bool housekeeping_test_cpu(int cpu);
-extern void __init housekeeping_init(void);
 
 #else
 
@@ -26,7 +25,6 @@ static inline const struct cpumask *housekeeping_cpumask(void)
 }
 
 static inline void housekeeping_affine(struct task_struct *t) { }
-static inline void housekeeping_init(void) { }
 #endif /* CONFIG_NO_HZ_FULL */
 
 static inline bool housekeeping_cpu(int cpu)
diff --git a/init/main.c b/init/main.c
index 9904a1e..9789ab7 100644
--- a/init/main.c
+++ b/init/main.c
@@ -46,7 +46,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -608,7 +607,6 @@ asmlinkage __visible void __init start_kernel(void)
early_irq_init();
init_IRQ();
tick_init();
-   housekeeping_init();
rcu_init_nohz();
init_timers();
hrtimers_init();
diff --git a/kernel/housekeeping.c b/kernel/housekeeping.c
index f8be7e6..a54765d 100644
--- a/kernel/housekeeping.c
+++ b/kernel/housekeeping.c
@@ -45,23 +45,21 @@ bool housekeeping_test_cpu(int cpu)
return true;
 }
 
-void __init housekeeping_init(void)
+/* Parse the boot-time housekeeping CPU list from the kernel parameters. */
+static int __init housekeeping_setup(char *str)
 {
-   if (!tick_nohz_full_enabled())
-   return;
-
-   if (!alloc_cpumask_var(_mask, GFP_KERNEL)) {
-   WARN(1, "NO_HZ: Can't allocate not-full dynticks cpumask\n");
-   cpumask_clear(tick_nohz_full_mask);
-   tick_nohz_full_running = false;
-   return;
+   alloc_bootmem_cpumask_var(_mask);
+   if (cpulist_parse(str, housekeeping_mask) < 0) {
+   pr_warn("Housekeeping: Incorrect cpumask\n");
+   free_bootmem_cpumask_var(housekeeping_mask);
+   return 1;
}
 
-   cpumask_andnot(housekeeping_mask,
-  cpu_possible_mask, tick_nohz_full_mask);
-
static_branch_enable(_overriden);
 
/* We need at least one CPU to handle housekeeping work */
WARN_ON_ONCE(cpumask_empty(housekeeping_mask));
+
+   return 1;
 }
+__setup("housekeeping=", housekeeping_setup);
-- 
2.7.4



[RFC PATCH 7/9] housekeeping: Use own boot option, independant from nohz

2017-07-21 Thread Frederic Weisbecker
The housekeeping is currently driven by nohz_full where any CPU that
is not in the nohz_full range is considered as a housekeeper. This is
a design mistake because nohz is just a detail among all the existing
isolation features. Nohz shouldn't imply anything else than tick related
things.

We rather want to drive all the isolation features from the housekeeping
subsystem which is responsible for all the work that can be either
affined (unpinned workqueues, timers, kthreads, ...) or offloaded
(scheduler tick, ...).

Let's start with a boot option to define the housekeepers. We should be
able to further enhance that through cpusets.

Signed-off-by: Frederic Weisbecker 
Cc: Chris Metcalf 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Mike Galbraith 
Cc: Ingo Molnar 
Cc: Christoph Lameter 
Cc: Paul E. McKenney 
Cc: Wanpeng Li 
Cc: Luiz Capitulino 
---
 include/linux/housekeeping.h |  2 --
 init/main.c  |  2 --
 kernel/housekeeping.c| 22 ++
 3 files changed, 10 insertions(+), 16 deletions(-)

diff --git a/include/linux/housekeeping.h b/include/linux/housekeeping.h
index 320cc2b..ba769c8 100644
--- a/include/linux/housekeeping.h
+++ b/include/linux/housekeeping.h
@@ -11,7 +11,6 @@ extern int housekeeping_any_cpu(void);
 extern const struct cpumask *housekeeping_cpumask(void);
 extern void housekeeping_affine(struct task_struct *t);
 extern bool housekeeping_test_cpu(int cpu);
-extern void __init housekeeping_init(void);
 
 #else
 
@@ -26,7 +25,6 @@ static inline const struct cpumask *housekeeping_cpumask(void)
 }
 
 static inline void housekeeping_affine(struct task_struct *t) { }
-static inline void housekeeping_init(void) { }
 #endif /* CONFIG_NO_HZ_FULL */
 
 static inline bool housekeeping_cpu(int cpu)
diff --git a/init/main.c b/init/main.c
index 9904a1e..9789ab7 100644
--- a/init/main.c
+++ b/init/main.c
@@ -46,7 +46,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -608,7 +607,6 @@ asmlinkage __visible void __init start_kernel(void)
early_irq_init();
init_IRQ();
tick_init();
-   housekeeping_init();
rcu_init_nohz();
init_timers();
hrtimers_init();
diff --git a/kernel/housekeeping.c b/kernel/housekeeping.c
index f8be7e6..a54765d 100644
--- a/kernel/housekeeping.c
+++ b/kernel/housekeeping.c
@@ -45,23 +45,21 @@ bool housekeeping_test_cpu(int cpu)
return true;
 }
 
-void __init housekeeping_init(void)
+/* Parse the boot-time housekeeping CPU list from the kernel parameters. */
+static int __init housekeeping_setup(char *str)
 {
-   if (!tick_nohz_full_enabled())
-   return;
-
-   if (!alloc_cpumask_var(_mask, GFP_KERNEL)) {
-   WARN(1, "NO_HZ: Can't allocate not-full dynticks cpumask\n");
-   cpumask_clear(tick_nohz_full_mask);
-   tick_nohz_full_running = false;
-   return;
+   alloc_bootmem_cpumask_var(_mask);
+   if (cpulist_parse(str, housekeeping_mask) < 0) {
+   pr_warn("Housekeeping: Incorrect cpumask\n");
+   free_bootmem_cpumask_var(housekeeping_mask);
+   return 1;
}
 
-   cpumask_andnot(housekeeping_mask,
-  cpu_possible_mask, tick_nohz_full_mask);
-
static_branch_enable(_overriden);
 
/* We need at least one CPU to handle housekeeping work */
WARN_ON_ONCE(cpumask_empty(housekeeping_mask));
+
+   return 1;
 }
+__setup("housekeeping=", housekeeping_setup);
-- 
2.7.4



[RFC PATCH 0/9] Introduce housekeeping subsystem

2017-07-21 Thread Frederic Weisbecker
I'm leaving for two weeks so this is food for thoughts in the meantime :)

We have a design issue with nohz_full: it drives the isolation features
through the *housekeeping*() functions: kthreads, unpinned timers,
watchdog, ...

But things should work the other way around because the tick is just an
isolation feature among others.

So we need a housekeeping subsystem to drive all these isolation
features, including nohz full in a later iteration. For now this is a
basic draft. In the long run this subsystem should also drive the tick
offloading (remove residual 1Hz) and all unbound kthreads.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
nohz/0hz

HEAD: 68e3af1de5db228bf6c2a5e721bce59a02cfc4e1

Thanks,
Frederic
---

Frederic Weisbecker (9):
  housekeeping: Move housekeeping related code to its own file
  watchdog: Use housekeeping_cpumask() instead of ad-hoc version
  housekeeping: Provide a dynamic off-case to housekeeping_any_cpu()
  housekeeping: Make housekeeping cpumask private
  housekeeping: Use its own static key
  housekeeping: Rename is_housekeeping_cpu to housekeeping_cpu
  housekeeping: Use own boot option, independant from nohz
  housekeeping: Move it under own config, independant from NO_HZ
  workqueue: Affine unbound workqueues to housekeeping cpumask


 drivers/net/ethernet/tile/tilegx.c |  2 +-
 include/linux/housekeeping.h   | 39 +++
 include/linux/tick.h   | 37 --
 init/Kconfig   |  6 
 kernel/Makefile|  1 +
 kernel/housekeeping.c  | 65 ++
 kernel/rcu/tree_plugin.h   |  1 +
 kernel/rcu/update.c|  1 +
 kernel/sched/core.c|  7 ++--
 kernel/sched/fair.c|  3 +-
 kernel/time/tick-sched.c   | 18 ---
 kernel/watchdog.c  | 12 +++
 kernel/workqueue.c |  3 +-
 13 files changed, 126 insertions(+), 69 deletions(-)


[RFC PATCH 9/9] workqueue: Affine unbound workqueues to housekeeping cpumask

2017-07-21 Thread Frederic Weisbecker
Although the unbound workqueue cpumask can be overriden through sysfs,
the housekeeping cpumask which drives the CPU isolation provides a
relevant default value.

Signed-off-by: Frederic Weisbecker 
Cc: Chris Metcalf 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Mike Galbraith 
Cc: Ingo Molnar 
Cc: Christoph Lameter 
Cc: Paul E. McKenney 
Cc: Wanpeng Li 
Cc: Luiz Capitulino 
---
 kernel/workqueue.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index a86688f..4303c06 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -48,6 +48,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "workqueue_internal.h"
 
@@ -5524,7 +5525,7 @@ int __init workqueue_init_early(void)
WARN_ON(__alignof__(struct pool_workqueue) < __alignof__(long long));
 
BUG_ON(!alloc_cpumask_var(_unbound_cpumask, GFP_KERNEL));
-   cpumask_copy(wq_unbound_cpumask, cpu_possible_mask);
+   cpumask_copy(wq_unbound_cpumask, housekeeping_cpumask());
 
pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC);
 
-- 
2.7.4



[RFC PATCH 3/9] housekeeping: Provide a dynamic off-case to housekeeping_any_cpu()

2017-07-21 Thread Frederic Weisbecker
housekeeping_any_cpu() doesn't handle correctly the case where
CONFIG_NO_HZ_FULL=y and no CPU is in nohz_full mode. So far no caller
needs this but lets prepare to avoid any future surprise.

Signed-off-by: Frederic Weisbecker 
Cc: Chris Metcalf 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Mike Galbraith 
Cc: Ingo Molnar 
Cc: Christoph Lameter 
Cc: Paul E. McKenney 
Cc: Wanpeng Li 
Cc: Luiz Capitulino 
---
 include/linux/housekeeping.h | 21 -
 1 file changed, 8 insertions(+), 13 deletions(-)

diff --git a/include/linux/housekeeping.h b/include/linux/housekeeping.h
index 3d6a8e6..64d0ee5 100644
--- a/include/linux/housekeeping.h
+++ b/include/linux/housekeeping.h
@@ -7,24 +7,19 @@
 
 #ifdef CONFIG_NO_HZ_FULL
 extern cpumask_var_t housekeeping_mask;
-
-static inline int housekeeping_any_cpu(void)
-{
-   return cpumask_any_and(housekeeping_mask, cpu_online_mask);
-}
-
 extern void __init housekeeping_init(void);
-
 #else
-
-static inline int housekeeping_any_cpu(void)
-{
-   return smp_processor_id();
-}
-
 static inline void housekeeping_init(void) { }
 #endif /* CONFIG_NO_HZ_FULL */
 
+static inline int housekeeping_any_cpu(void)
+{
+#ifdef CONFIG_NO_HZ_FULL
+   if (tick_nohz_full_enabled())
+   return cpumask_any_and(housekeeping_mask, cpu_online_mask);
+#endif
+   return smp_processor_id();
+}
 
 static inline const struct cpumask *housekeeping_cpumask(void)
 {
-- 
2.7.4



[RFC PATCH 0/9] Introduce housekeeping subsystem

2017-07-21 Thread Frederic Weisbecker
I'm leaving for two weeks so this is food for thoughts in the meantime :)

We have a design issue with nohz_full: it drives the isolation features
through the *housekeeping*() functions: kthreads, unpinned timers,
watchdog, ...

But things should work the other way around because the tick is just an
isolation feature among others.

So we need a housekeeping subsystem to drive all these isolation
features, including nohz full in a later iteration. For now this is a
basic draft. In the long run this subsystem should also drive the tick
offloading (remove residual 1Hz) and all unbound kthreads.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
nohz/0hz

HEAD: 68e3af1de5db228bf6c2a5e721bce59a02cfc4e1

Thanks,
Frederic
---

Frederic Weisbecker (9):
  housekeeping: Move housekeeping related code to its own file
  watchdog: Use housekeeping_cpumask() instead of ad-hoc version
  housekeeping: Provide a dynamic off-case to housekeeping_any_cpu()
  housekeeping: Make housekeeping cpumask private
  housekeeping: Use its own static key
  housekeeping: Rename is_housekeeping_cpu to housekeeping_cpu
  housekeeping: Use own boot option, independant from nohz
  housekeeping: Move it under own config, independant from NO_HZ
  workqueue: Affine unbound workqueues to housekeeping cpumask


 drivers/net/ethernet/tile/tilegx.c |  2 +-
 include/linux/housekeeping.h   | 39 +++
 include/linux/tick.h   | 37 --
 init/Kconfig   |  6 
 kernel/Makefile|  1 +
 kernel/housekeeping.c  | 65 ++
 kernel/rcu/tree_plugin.h   |  1 +
 kernel/rcu/update.c|  1 +
 kernel/sched/core.c|  7 ++--
 kernel/sched/fair.c|  3 +-
 kernel/time/tick-sched.c   | 18 ---
 kernel/watchdog.c  | 12 +++
 kernel/workqueue.c |  3 +-
 13 files changed, 126 insertions(+), 69 deletions(-)


[RFC PATCH 9/9] workqueue: Affine unbound workqueues to housekeeping cpumask

2017-07-21 Thread Frederic Weisbecker
Although the unbound workqueue cpumask can be overriden through sysfs,
the housekeeping cpumask which drives the CPU isolation provides a
relevant default value.

Signed-off-by: Frederic Weisbecker 
Cc: Chris Metcalf 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Mike Galbraith 
Cc: Ingo Molnar 
Cc: Christoph Lameter 
Cc: Paul E. McKenney 
Cc: Wanpeng Li 
Cc: Luiz Capitulino 
---
 kernel/workqueue.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index a86688f..4303c06 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -48,6 +48,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "workqueue_internal.h"
 
@@ -5524,7 +5525,7 @@ int __init workqueue_init_early(void)
WARN_ON(__alignof__(struct pool_workqueue) < __alignof__(long long));
 
BUG_ON(!alloc_cpumask_var(_unbound_cpumask, GFP_KERNEL));
-   cpumask_copy(wq_unbound_cpumask, cpu_possible_mask);
+   cpumask_copy(wq_unbound_cpumask, housekeeping_cpumask());
 
pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC);
 
-- 
2.7.4



[RFC PATCH 3/9] housekeeping: Provide a dynamic off-case to housekeeping_any_cpu()

2017-07-21 Thread Frederic Weisbecker
housekeeping_any_cpu() doesn't handle correctly the case where
CONFIG_NO_HZ_FULL=y and no CPU is in nohz_full mode. So far no caller
needs this but lets prepare to avoid any future surprise.

Signed-off-by: Frederic Weisbecker 
Cc: Chris Metcalf 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Mike Galbraith 
Cc: Ingo Molnar 
Cc: Christoph Lameter 
Cc: Paul E. McKenney 
Cc: Wanpeng Li 
Cc: Luiz Capitulino 
---
 include/linux/housekeeping.h | 21 -
 1 file changed, 8 insertions(+), 13 deletions(-)

diff --git a/include/linux/housekeeping.h b/include/linux/housekeeping.h
index 3d6a8e6..64d0ee5 100644
--- a/include/linux/housekeeping.h
+++ b/include/linux/housekeeping.h
@@ -7,24 +7,19 @@
 
 #ifdef CONFIG_NO_HZ_FULL
 extern cpumask_var_t housekeeping_mask;
-
-static inline int housekeeping_any_cpu(void)
-{
-   return cpumask_any_and(housekeeping_mask, cpu_online_mask);
-}
-
 extern void __init housekeeping_init(void);
-
 #else
-
-static inline int housekeeping_any_cpu(void)
-{
-   return smp_processor_id();
-}
-
 static inline void housekeeping_init(void) { }
 #endif /* CONFIG_NO_HZ_FULL */
 
+static inline int housekeeping_any_cpu(void)
+{
+#ifdef CONFIG_NO_HZ_FULL
+   if (tick_nohz_full_enabled())
+   return cpumask_any_and(housekeeping_mask, cpu_online_mask);
+#endif
+   return smp_processor_id();
+}
 
 static inline const struct cpumask *housekeeping_cpumask(void)
 {
-- 
2.7.4



[RFC PATCH 5/9] housekeeping: Use its own static key

2017-07-21 Thread Frederic Weisbecker
Housekeeping code still depends on nohz_full static key. Since we want
to decouple housekeeping from nohz, lets create a housekeeping own static
key. It's mostly relevant for calls to is_housekeeping_cpu() from the
scheduler.

Signed-off-by: Frederic Weisbecker 
Cc: Chris Metcalf 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Mike Galbraith 
Cc: Ingo Molnar 
Cc: Christoph Lameter 
Cc: Paul E. McKenney 
Cc: Wanpeng Li 
Cc: Luiz Capitulino 
---
 include/linux/housekeeping.h |  3 ++-
 kernel/housekeeping.c| 14 +-
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/include/linux/housekeeping.h b/include/linux/housekeeping.h
index 31a1401..cbe8d63 100644
--- a/include/linux/housekeeping.h
+++ b/include/linux/housekeeping.h
@@ -6,6 +6,7 @@
 #include 
 
 #ifdef CONFIG_NO_HZ_FULL
+DECLARE_STATIC_KEY_FALSE(housekeeping_overriden);
 extern int housekeeping_any_cpu(void);
 extern const struct cpumask *housekeeping_cpumask(void);
 extern void housekeeping_affine(struct task_struct *t);
@@ -31,7 +32,7 @@ static inline void housekeeping_init(void) { }
 static inline bool is_housekeeping_cpu(int cpu)
 {
 #ifdef CONFIG_NO_HZ_FULL
-   if (tick_nohz_full_enabled())
+   if (static_branch_unlikely(_overriden))
return housekeeping_test_cpu(cpu);
 #endif
return true;
diff --git a/kernel/housekeeping.c b/kernel/housekeeping.c
index 0183e75..f8be7e6 100644
--- a/kernel/housekeeping.c
+++ b/kernel/housekeeping.c
@@ -9,12 +9,15 @@
 #include 
 #include 
 #include 
+#include 
 
+DEFINE_STATIC_KEY_FALSE(housekeeping_overriden);
+EXPORT_SYMBOL_GPL(housekeeping_overriden);
 static cpumask_var_t housekeeping_mask;
 
 int housekeeping_any_cpu(void)
 {
-   if (tick_nohz_full_enabled())
+   if (static_branch_unlikely(_overriden))
return cpumask_any_and(housekeeping_mask, cpu_online_mask);
 
return smp_processor_id();
@@ -22,7 +25,7 @@ int housekeeping_any_cpu(void)
 
 const struct cpumask *housekeeping_cpumask(void)
 {
-   if (tick_nohz_full_enabled())
+   if (static_branch_unlikely(_overriden))
return housekeeping_mask;
 
return cpu_possible_mask;
@@ -30,19 +33,18 @@ const struct cpumask *housekeeping_cpumask(void)
 
 void housekeeping_affine(struct task_struct *t)
 {
-   if (tick_nohz_full_enabled())
+   if (static_branch_unlikely(_overriden))
set_cpus_allowed_ptr(t, housekeeping_mask);
 }
 
 bool housekeeping_test_cpu(int cpu)
 {
-   if (tick_nohz_full_enabled())
+   if (static_branch_unlikely(_overriden))
return cpumask_test_cpu(cpu, housekeeping_mask);
 
return true;
 }
 
-
 void __init housekeeping_init(void)
 {
if (!tick_nohz_full_enabled())
@@ -58,6 +60,8 @@ void __init housekeeping_init(void)
cpumask_andnot(housekeeping_mask,
   cpu_possible_mask, tick_nohz_full_mask);
 
+   static_branch_enable(_overriden);
+
/* We need at least one CPU to handle housekeeping work */
WARN_ON_ONCE(cpumask_empty(housekeeping_mask));
 }
-- 
2.7.4



[RFC PATCH 5/9] housekeeping: Use its own static key

2017-07-21 Thread Frederic Weisbecker
Housekeeping code still depends on nohz_full static key. Since we want
to decouple housekeeping from nohz, lets create a housekeeping own static
key. It's mostly relevant for calls to is_housekeeping_cpu() from the
scheduler.

Signed-off-by: Frederic Weisbecker 
Cc: Chris Metcalf 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Mike Galbraith 
Cc: Ingo Molnar 
Cc: Christoph Lameter 
Cc: Paul E. McKenney 
Cc: Wanpeng Li 
Cc: Luiz Capitulino 
---
 include/linux/housekeeping.h |  3 ++-
 kernel/housekeeping.c| 14 +-
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/include/linux/housekeeping.h b/include/linux/housekeeping.h
index 31a1401..cbe8d63 100644
--- a/include/linux/housekeeping.h
+++ b/include/linux/housekeeping.h
@@ -6,6 +6,7 @@
 #include 
 
 #ifdef CONFIG_NO_HZ_FULL
+DECLARE_STATIC_KEY_FALSE(housekeeping_overriden);
 extern int housekeeping_any_cpu(void);
 extern const struct cpumask *housekeeping_cpumask(void);
 extern void housekeeping_affine(struct task_struct *t);
@@ -31,7 +32,7 @@ static inline void housekeeping_init(void) { }
 static inline bool is_housekeeping_cpu(int cpu)
 {
 #ifdef CONFIG_NO_HZ_FULL
-   if (tick_nohz_full_enabled())
+   if (static_branch_unlikely(_overriden))
return housekeeping_test_cpu(cpu);
 #endif
return true;
diff --git a/kernel/housekeeping.c b/kernel/housekeeping.c
index 0183e75..f8be7e6 100644
--- a/kernel/housekeeping.c
+++ b/kernel/housekeeping.c
@@ -9,12 +9,15 @@
 #include 
 #include 
 #include 
+#include 
 
+DEFINE_STATIC_KEY_FALSE(housekeeping_overriden);
+EXPORT_SYMBOL_GPL(housekeeping_overriden);
 static cpumask_var_t housekeeping_mask;
 
 int housekeeping_any_cpu(void)
 {
-   if (tick_nohz_full_enabled())
+   if (static_branch_unlikely(_overriden))
return cpumask_any_and(housekeeping_mask, cpu_online_mask);
 
return smp_processor_id();
@@ -22,7 +25,7 @@ int housekeeping_any_cpu(void)
 
 const struct cpumask *housekeeping_cpumask(void)
 {
-   if (tick_nohz_full_enabled())
+   if (static_branch_unlikely(_overriden))
return housekeeping_mask;
 
return cpu_possible_mask;
@@ -30,19 +33,18 @@ const struct cpumask *housekeeping_cpumask(void)
 
 void housekeeping_affine(struct task_struct *t)
 {
-   if (tick_nohz_full_enabled())
+   if (static_branch_unlikely(_overriden))
set_cpus_allowed_ptr(t, housekeeping_mask);
 }
 
 bool housekeeping_test_cpu(int cpu)
 {
-   if (tick_nohz_full_enabled())
+   if (static_branch_unlikely(_overriden))
return cpumask_test_cpu(cpu, housekeeping_mask);
 
return true;
 }
 
-
 void __init housekeeping_init(void)
 {
if (!tick_nohz_full_enabled())
@@ -58,6 +60,8 @@ void __init housekeeping_init(void)
cpumask_andnot(housekeeping_mask,
   cpu_possible_mask, tick_nohz_full_mask);
 
+   static_branch_enable(_overriden);
+
/* We need at least one CPU to handle housekeeping work */
WARN_ON_ONCE(cpumask_empty(housekeeping_mask));
 }
-- 
2.7.4



Re: [PATCH v5 2/2] thermal: uniphier: add UniPhier thermal driver

2017-07-21 Thread Masahiro Yamada
2017-07-21 20:21 GMT+09:00 Kunihiko Hayashi :

> +static int uniphier_tm_initialize_sensor(struct uniphier_tm_dev *tdev)
> +{
> +   struct regmap *map = tdev->regmap;
> +   u32 val;
> +   int ret;
> +
> +   /* stop PVT */
> +   regmap_write_bits(map, tdev->data->block_base + PVTCTLEN,
> + PVTCTLEN_EN, 0);
> +
> +   /*
> +* set default value if missing calibrated value
> +*
> +* Since SoC has a calibrated value that was set in advance,
> +* TMODCOEF shows non-zero and PVT refers the value internally.
> +*
> +* However, some boards don't have the calibrated value.
> +* In that case, TMODCOEF shows zero and the driver has to set
> +* default value manually.
> +*/
> +   ret = regmap_read(map, tdev->data->map_base + TMODCOEF, );
> +   if (ret)
> +   return ret;
> +   if (!val)
> +   regmap_write(map, tdev->data->tmod_setup_addr,
> +   TMODSETUP0_EN | TMODSETUP0_VAL(tdev->tmod_calib0) 
> |
> +   TMODSETUP1_EN | 
> TMODSETUP1_VAL(tdev->tmod_calib1));


This code is strange.

What if TMODCOEF has no calibrated value and
"socionext,tmod-calibration" is not set either?


->tmod_setupaddr will be set to zero
and the sensor would not work, right?




> +   /* get tmod-calibration values */
> +   calib = of_get_property(dev->of_node, "socionext,tmod-calibration",
> +   NULL);
> +   if (calib) {
> +   tdev->tmod_calib0 = of_read_number(calib, 1);
> +   tdev->tmod_calib1 = of_read_number(calib + 1, 1);
> +   }


>From your DT change (https://patchwork.kernel.org/patch/9826391/),
this property seems a pair of u32 values, like follows:

socionext,tmod-calibration = <0x0f22 0x68ee>;


Why do you need to use of_read_number() to retrieve each u32 value?

See the comment and return value (u64):

-->8-
/* Helper to read a big number; size is in cells (not bytes) */
static inline u64 of_read_number(const __be32 *cell, int size)
--8<-


Also, you are not checking the length of the property.
of_read_number() may over-run the property.



of_property_read_u32_array() will be a better choice.


I'd propose the code like follows:





diff --git a/drivers/thermal/uniphier_thermal.c
b/drivers/thermal/uniphier_thermal.c
index 1a77c5bf6930..eea8a3053584 100644
--- a/drivers/thermal/uniphier_thermal.c
+++ b/drivers/thermal/uniphier_thermal.c
@@ -94,16 +94,16 @@ struct uniphier_tm_soc_data {
 struct uniphier_tm_dev {
struct regmap *regmap;
bool alert_en[ALERT_CH_NUM];
-   u32 tmod_calib0;
-   u32 tmod_calib1;
struct thermal_zone_device *tz_dev;
const struct uniphier_tm_soc_data *data;
 };

-static int uniphier_tm_initialize_sensor(struct uniphier_tm_dev *tdev)
+static int uniphier_tm_initialize_sensor(struct uniphier_tm_dev *tdev,
+struct device *dev)
 {
struct regmap *map = tdev->regmap;
u32 val;
+   u32 tmod_val[2];
int ret;

/* stop PVT */
@@ -123,10 +123,22 @@ static int uniphier_tm_initialize_sensor(struct
uniphier_tm_dev *tdev)
ret = regmap_read(map, tdev->data->map_base + TMODCOEF, );
if (ret)
return ret;
-   if (!val)
+   if (!val) {
+   /*
+* No preset value found in the register.
+* Look for the fall-back values in DT.
+*/
+   ret = of_property_read_u32_array(dev->of_node,
+"socionext,tmod-calibration",
+tmod_val,
ARRAY_SIZE(tmod_val));
+   if (ret) {
+   dev_err(dev, "neither register nor DT has
calibrated values.\n");
+   return ret;
+   }
regmap_write(map, tdev->data->tmod_setup_addr,
-   TMODSETUP0_EN | TMODSETUP0_VAL(tdev->tmod_calib0) |
-   TMODSETUP1_EN | TMODSETUP1_VAL(tdev->tmod_calib1));
+TMODSETUP0_EN | TMODSETUP0_VAL(tmod_val[0]) |
+TMODSETUP1_EN | TMODSETUP1_VAL(tmod_val[1]));
+   }

/* select temperature mode */
regmap_write_bits(map, tdev->data->block_base + PVTCTLMODE,
@@ -254,7 +266,6 @@ static int uniphier_tm_probe(struct platform_device *pdev)
struct device_node *parent;
struct uniphier_tm_dev *tdev;
const struct thermal_trip *trips;
-   const __be32 *calib;
int i, ret, irq, ntrips, crit_temp = INT_MAX;

tdev = devm_kzalloc(dev, sizeof(*tdev), GFP_KERNEL);
@@ -269,14 +280,6 @@ static int uniphier_tm_probe(struct platform_device *pdev)
if (irq < 0)
return irq;

-   

Re: [PATCH v5 2/2] thermal: uniphier: add UniPhier thermal driver

2017-07-21 Thread Masahiro Yamada
2017-07-21 20:21 GMT+09:00 Kunihiko Hayashi :

> +static int uniphier_tm_initialize_sensor(struct uniphier_tm_dev *tdev)
> +{
> +   struct regmap *map = tdev->regmap;
> +   u32 val;
> +   int ret;
> +
> +   /* stop PVT */
> +   regmap_write_bits(map, tdev->data->block_base + PVTCTLEN,
> + PVTCTLEN_EN, 0);
> +
> +   /*
> +* set default value if missing calibrated value
> +*
> +* Since SoC has a calibrated value that was set in advance,
> +* TMODCOEF shows non-zero and PVT refers the value internally.
> +*
> +* However, some boards don't have the calibrated value.
> +* In that case, TMODCOEF shows zero and the driver has to set
> +* default value manually.
> +*/
> +   ret = regmap_read(map, tdev->data->map_base + TMODCOEF, );
> +   if (ret)
> +   return ret;
> +   if (!val)
> +   regmap_write(map, tdev->data->tmod_setup_addr,
> +   TMODSETUP0_EN | TMODSETUP0_VAL(tdev->tmod_calib0) 
> |
> +   TMODSETUP1_EN | 
> TMODSETUP1_VAL(tdev->tmod_calib1));


This code is strange.

What if TMODCOEF has no calibrated value and
"socionext,tmod-calibration" is not set either?


->tmod_setupaddr will be set to zero
and the sensor would not work, right?




> +   /* get tmod-calibration values */
> +   calib = of_get_property(dev->of_node, "socionext,tmod-calibration",
> +   NULL);
> +   if (calib) {
> +   tdev->tmod_calib0 = of_read_number(calib, 1);
> +   tdev->tmod_calib1 = of_read_number(calib + 1, 1);
> +   }


>From your DT change (https://patchwork.kernel.org/patch/9826391/),
this property seems a pair of u32 values, like follows:

socionext,tmod-calibration = <0x0f22 0x68ee>;


Why do you need to use of_read_number() to retrieve each u32 value?

See the comment and return value (u64):

-->8-
/* Helper to read a big number; size is in cells (not bytes) */
static inline u64 of_read_number(const __be32 *cell, int size)
--8<-


Also, you are not checking the length of the property.
of_read_number() may over-run the property.



of_property_read_u32_array() will be a better choice.


I'd propose the code like follows:





diff --git a/drivers/thermal/uniphier_thermal.c
b/drivers/thermal/uniphier_thermal.c
index 1a77c5bf6930..eea8a3053584 100644
--- a/drivers/thermal/uniphier_thermal.c
+++ b/drivers/thermal/uniphier_thermal.c
@@ -94,16 +94,16 @@ struct uniphier_tm_soc_data {
 struct uniphier_tm_dev {
struct regmap *regmap;
bool alert_en[ALERT_CH_NUM];
-   u32 tmod_calib0;
-   u32 tmod_calib1;
struct thermal_zone_device *tz_dev;
const struct uniphier_tm_soc_data *data;
 };

-static int uniphier_tm_initialize_sensor(struct uniphier_tm_dev *tdev)
+static int uniphier_tm_initialize_sensor(struct uniphier_tm_dev *tdev,
+struct device *dev)
 {
struct regmap *map = tdev->regmap;
u32 val;
+   u32 tmod_val[2];
int ret;

/* stop PVT */
@@ -123,10 +123,22 @@ static int uniphier_tm_initialize_sensor(struct
uniphier_tm_dev *tdev)
ret = regmap_read(map, tdev->data->map_base + TMODCOEF, );
if (ret)
return ret;
-   if (!val)
+   if (!val) {
+   /*
+* No preset value found in the register.
+* Look for the fall-back values in DT.
+*/
+   ret = of_property_read_u32_array(dev->of_node,
+"socionext,tmod-calibration",
+tmod_val,
ARRAY_SIZE(tmod_val));
+   if (ret) {
+   dev_err(dev, "neither register nor DT has
calibrated values.\n");
+   return ret;
+   }
regmap_write(map, tdev->data->tmod_setup_addr,
-   TMODSETUP0_EN | TMODSETUP0_VAL(tdev->tmod_calib0) |
-   TMODSETUP1_EN | TMODSETUP1_VAL(tdev->tmod_calib1));
+TMODSETUP0_EN | TMODSETUP0_VAL(tmod_val[0]) |
+TMODSETUP1_EN | TMODSETUP1_VAL(tmod_val[1]));
+   }

/* select temperature mode */
regmap_write_bits(map, tdev->data->block_base + PVTCTLMODE,
@@ -254,7 +266,6 @@ static int uniphier_tm_probe(struct platform_device *pdev)
struct device_node *parent;
struct uniphier_tm_dev *tdev;
const struct thermal_trip *trips;
-   const __be32 *calib;
int i, ret, irq, ntrips, crit_temp = INT_MAX;

tdev = devm_kzalloc(dev, sizeof(*tdev), GFP_KERNEL);
@@ -269,14 +280,6 @@ static int uniphier_tm_probe(struct platform_device *pdev)
if (irq < 0)
return irq;

-   /* get tmod-calibration 

Re: [PATCH v6 RESEND] x86/boot/KASLR: Restrict kernel to be randomized in mirror regions

2017-07-21 Thread Baoquan He
On 07/21/17 at 12:37pm, Ingo Molnar wrote:
> 
> * Baoquan He  wrote:
> 
> > +/*
> > + * Returns true if mirror region found (and must have been processed
> > + * for slots adding)
> > + */
> > +static bool process_efi_entries(unsigned long minimum,
> > +   unsigned long image_size)
> 
> Also, please don't break the line in the middle of the prototype.

OK, will change it into oneline. I worry it's a little too long since
it's 80 chars wide if not break.
> 
> > +{
> > +   struct efi_info *e = _params->efi_info;
> > +   bool efi_mirror_found = false;
> > +   struct mem_vector region;
> > +   efi_memory_desc_t *md;
> > +   unsigned long pmap;
> > +   char *signature;
> > +   u32 nr_desc;
> > +   int i;
> > +
> > +   signature = (char *)_params->efi_info.efi_loader_signature;
> 
> This is sloppy too: we already have '_params->efi_info' in 'e', why do 
> you 
> duplicate it again, why not write 'e->efi_loader_signature'??

Right, will change.

> 
> > +   if (strncmp(signature, EFI32_LOADER_SIGNATURE, 4) &&
> > +   strncmp(signature, EFI64_LOADER_SIGNATURE, 4))
> > +   return false;
> > +
> > +#ifdef CONFIG_X86_32
> > +   /* Can't handle data above 4GB at this time */
> > +   if (e->efi_memmap_hi) {
> > +   warn("Memory map is above 4GB, EFI should be disabled.\n");
> > +   return false;
> 
> This kernel warning is pretty passive-aggressive: please explain what the 
> problem 
> is and how it can be resolved.
Maybe it can be like :

warn("Data above 4G can't be handled now in 32bit system, EFI should be 
disabled.\n");
> > +   return false;

> 
> > +   }
> > +   pmap =  e->efi_memmap;
> > +#else
> > +   pmap = (e->efi_memmap | ((__u64)e->efi_memmap_hi << 32));
> > +#endif
> > +
> > +   nr_desc = e->efi_memmap_size / e->efi_memdesc_size;
> > +   for (i = 0; i < nr_desc; i++) {
> > +   md = (efi_memory_desc_t *)(pmap + (i * e->efi_memdesc_size));
> 
> This looks unnecessarily obfuscated: why not initialize 'md' to 'pmap' when 
> pmap 
> is calculated and just use md[i]?

There are places where the efi map is getting and used like this. E.g
in efi_high_alloc() of drivers/firmware/efi/libstub/efi-stub-helper.c.
EFI developers worry the size of efi_memory_desc_t could not be the same
as e->efi_memdesc_size?

Hi Matt,

Could you help have a look at this?

> 
> > +static inline bool process_efi_entries(unsigned long minimum,
> > +  unsigned long image_size)
> 
> ugly linebreak again ...

The whole line is more than 80. I break the line and use tab and space
to make it align with above 'unsigned long minimum'. Don't know why it
becomes messy in patch. Will check and try again.

Thanks
Baoquan


Re: [PATCH v6 RESEND] x86/boot/KASLR: Restrict kernel to be randomized in mirror regions

2017-07-21 Thread Baoquan He
On 07/21/17 at 12:37pm, Ingo Molnar wrote:
> 
> * Baoquan He  wrote:
> 
> > +/*
> > + * Returns true if mirror region found (and must have been processed
> > + * for slots adding)
> > + */
> > +static bool process_efi_entries(unsigned long minimum,
> > +   unsigned long image_size)
> 
> Also, please don't break the line in the middle of the prototype.

OK, will change it into oneline. I worry it's a little too long since
it's 80 chars wide if not break.
> 
> > +{
> > +   struct efi_info *e = _params->efi_info;
> > +   bool efi_mirror_found = false;
> > +   struct mem_vector region;
> > +   efi_memory_desc_t *md;
> > +   unsigned long pmap;
> > +   char *signature;
> > +   u32 nr_desc;
> > +   int i;
> > +
> > +   signature = (char *)_params->efi_info.efi_loader_signature;
> 
> This is sloppy too: we already have '_params->efi_info' in 'e', why do 
> you 
> duplicate it again, why not write 'e->efi_loader_signature'??

Right, will change.

> 
> > +   if (strncmp(signature, EFI32_LOADER_SIGNATURE, 4) &&
> > +   strncmp(signature, EFI64_LOADER_SIGNATURE, 4))
> > +   return false;
> > +
> > +#ifdef CONFIG_X86_32
> > +   /* Can't handle data above 4GB at this time */
> > +   if (e->efi_memmap_hi) {
> > +   warn("Memory map is above 4GB, EFI should be disabled.\n");
> > +   return false;
> 
> This kernel warning is pretty passive-aggressive: please explain what the 
> problem 
> is and how it can be resolved.
Maybe it can be like :

warn("Data above 4G can't be handled now in 32bit system, EFI should be 
disabled.\n");
> > +   return false;

> 
> > +   }
> > +   pmap =  e->efi_memmap;
> > +#else
> > +   pmap = (e->efi_memmap | ((__u64)e->efi_memmap_hi << 32));
> > +#endif
> > +
> > +   nr_desc = e->efi_memmap_size / e->efi_memdesc_size;
> > +   for (i = 0; i < nr_desc; i++) {
> > +   md = (efi_memory_desc_t *)(pmap + (i * e->efi_memdesc_size));
> 
> This looks unnecessarily obfuscated: why not initialize 'md' to 'pmap' when 
> pmap 
> is calculated and just use md[i]?

There are places where the efi map is getting and used like this. E.g
in efi_high_alloc() of drivers/firmware/efi/libstub/efi-stub-helper.c.
EFI developers worry the size of efi_memory_desc_t could not be the same
as e->efi_memdesc_size?

Hi Matt,

Could you help have a look at this?

> 
> > +static inline bool process_efi_entries(unsigned long minimum,
> > +  unsigned long image_size)
> 
> ugly linebreak again ...

The whole line is more than 80. I break the line and use tab and space
to make it align with above 'unsigned long minimum'. Don't know why it
becomes messy in patch. Will check and try again.

Thanks
Baoquan


Re: [PATCH] powerpc/44x/fsp2: correct dtb reg property for /sdhci@020c0000

2017-07-21 Thread Ian Campbell
On Fri, 2017-07-21 at 15:54 +0300, Ivan Mikhaylov wrote:
> Hi Ian,
> > Building the split device-tree tree[0] highlighted that upstream commit
> > 9eec6cb142bd ("powerpc/44x/fsp2: Add device tree for FSP2 board") introduced
> > this warning when building the device tree:
> > 
> > $ make CROSS_COMPILE=powerpc-linux-gnu- ARCH=powerpc fsp2.dtb
> >  CHK scripts/mod/devicetable-offsets.h
> >  DTC arch/powerpc/boot/fsp2.dtb
> > > > arch/powerpc/boot/fsp2.dtb: Warning (reg_format): "reg" property in 
> > > > /sdhci@020c has invalid length (8 bytes) (#address-cells == 2, 
> > > > #size-cells == 1)
> > 
> > This commit adds the second adress cell as zeroes to resolve the warning. 
> > Note:
> > I have no access to or information about this platform so this is purely a
> > guess as to the fix. An alternative would be to adjust #address-cells, but
> > whether that is correct or not depends on the platform.
> 
> Yes, this problem exists on this tag but it is already fixed and waiting for
> > review by this https://patchwork.kernel.org/patch/9819379/ . You can check 
> > it
> if you want, anyways it will go to powerpc next branch first.

Great, thanks for the info.




Re: [PATCH] powerpc/44x/fsp2: correct dtb reg property for /sdhci@020c0000

2017-07-21 Thread Ian Campbell
On Fri, 2017-07-21 at 15:54 +0300, Ivan Mikhaylov wrote:
> Hi Ian,
> > Building the split device-tree tree[0] highlighted that upstream commit
> > 9eec6cb142bd ("powerpc/44x/fsp2: Add device tree for FSP2 board") introduced
> > this warning when building the device tree:
> > 
> > $ make CROSS_COMPILE=powerpc-linux-gnu- ARCH=powerpc fsp2.dtb
> >  CHK scripts/mod/devicetable-offsets.h
> >  DTC arch/powerpc/boot/fsp2.dtb
> > > > arch/powerpc/boot/fsp2.dtb: Warning (reg_format): "reg" property in 
> > > > /sdhci@020c has invalid length (8 bytes) (#address-cells == 2, 
> > > > #size-cells == 1)
> > 
> > This commit adds the second adress cell as zeroes to resolve the warning. 
> > Note:
> > I have no access to or information about this platform so this is purely a
> > guess as to the fix. An alternative would be to adjust #address-cells, but
> > whether that is correct or not depends on the platform.
> 
> Yes, this problem exists on this tag but it is already fixed and waiting for
> > review by this https://patchwork.kernel.org/patch/9819379/ . You can check 
> > it
> if you want, anyways it will go to powerpc next branch first.

Great, thanks for the info.




Re: [PATCH v2 6/8] KVM: arm/arm64: vgic: Implement forwarding setting

2017-07-21 Thread Christoffer Dall
On Thu, Jun 15, 2017 at 02:52:38PM +0200, Eric Auger wrote:
> Implements kvm_vgic_[set|unset]_forwarding.
> 
> Handle low-level VGIC programming and consistent irqchip
> programming.
> 
> Signed-off-by: Eric Auger 
> 
> ---
> 
> v1 -> v2:
> - change the parameter names used in the declaration
> - use kvm_vgic_map/unmap_phys_irq and handle their returned value
> ---
>  include/kvm/arm_vgic.h   |  5 +++
>  virt/kvm/arm/vgic/vgic.c | 88 
> 
>  2 files changed, 93 insertions(+)
> 
> diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
> index cceb31d..5064a57 100644
> --- a/include/kvm/arm_vgic.h
> +++ b/include/kvm/arm_vgic.h
> @@ -343,4 +343,9 @@ int kvm_send_userspace_msi(struct kvm *kvm, struct 
> kvm_msi *msi);
>   */
>  int kvm_vgic_setup_default_irq_routing(struct kvm *kvm);
>  
> +int kvm_vgic_set_forwarding(struct kvm *kvm, unsigned int host_irq,
> + unsigned int vintid);
> +void kvm_vgic_unset_forwarding(struct kvm *kvm, unsigned int host_irq,
> +unsigned int vintid);
> +
>  #endif /* __KVM_ARM_VGIC_H */
> diff --git a/virt/kvm/arm/vgic/vgic.c b/virt/kvm/arm/vgic/vgic.c
> index 2e35ac7..9ee3e77 100644
> --- a/virt/kvm/arm/vgic/vgic.c
> +++ b/virt/kvm/arm/vgic/vgic.c
> @@ -781,3 +781,91 @@ bool kvm_vgic_map_is_active(struct kvm_vcpu *vcpu, 
> unsigned int vintid)
>   return map_is_active;
>  }
>  
> +/**
> + * kvm_vgic_set_forwarding - Set IRQ forwarding
> + *
> + * @kvm: kvm handle
> + * @host_irq: the host linux IRQ
> + * @vintid: the virtual INTID
> + *
> + * This function must be called when the IRQ is not active:
> + * ie. not active at GIC level and not currently under injection
> + * into the guest using the unforwarded mode. The physical IRQ must
> + * be disabled and all vCPUs must have been exited and prevented
> + * from being re-entered.
> + */
> +int kvm_vgic_set_forwarding(struct kvm *kvm, unsigned int host_irq,
> + unsigned int vintid)
> +{
> + struct kvm_vcpu *vcpu;
> + struct vgic_irq *irq;
> + int ret;
> +
> + kvm_debug("%s host_irq=%d vintid=%d\n", __func__, host_irq, vintid);

do you need to check if the vgic is initialized etc. here?

> +
> + if (!vgic_valid_spi(kvm, vintid))
> + return -EINVAL;
> +
> + irq = vgic_get_irq(kvm, NULL, vintid);
> + spin_lock(>irq_lock);
> +
> + if (irq->hw) {
> + ret = -EINVAL;

is this because it's already forwarded?  How about EBUSY or EEXIST
instead then?

> + goto unlock;
> + }
> + vcpu = irq->target_vcpu;
> + if (!vcpu) {
> + ret = -EAGAIN;

what is this case exactly?

> + goto unlock;
> + }
> + 
> + ret = kvm_vgic_map_irq(vcpu, irq, host_irq);
> + if (!ret)
> + irq_set_vcpu_affinity(host_irq, vcpu);

so this is essentially map + set_vcpu_affinity.  Why do we want the GIC
to do this in one go as opposed to leaving it up to the caller?

> +unlock:
> + spin_unlock(>irq_lock);
> + vgic_put_irq(kvm, irq);
> + return ret;
> +}
> +
> +/**
> + * kvm_vgic_unset_forwarding - Unset IRQ forwarding
> + *
> + * @kvm: KVM handle
> + * @host_irq: the host Linux IRQ number
> + * @vintid: virtual INTID
> + *
> + * This function must be called when the host irq is disabled and
> + * all vCPUs have been exited and prevented from being re-entered.
> + */
> +void kvm_vgic_unset_forwarding(struct kvm *kvm,
> +unsigned int host_irq,
> +unsigned int vintid)
> +{
> + struct vgic_irq *irq;
> + bool active;
> +
> + kvm_debug("%s host_irq=%d vintid=%d\n", __func__, host_irq, vintid);

do you need to check if the vgic is initialized etc. here?

> +
> + if (!vgic_valid_spi(kvm, vintid))
> + return;
> +
> + irq = vgic_get_irq(kvm, NULL, vintid);
> + spin_lock(>irq_lock);
> +
> + if (!irq->hw)
> + goto unlock;
> +
> + WARN_ON(irq_get_irqchip_state(host_irq, IRQCHIP_STATE_ACTIVE, ));
> +
> + if (active)
> + irq_set_irqchip_state(host_irq, IRQCHIP_STATE_ACTIVE, false);
> +
> + kvm_vgic_unmap_irq(irq);
> + irq_set_vcpu_affinity(host_irq, NULL);
> +
> +unlock:
> + spin_unlock(>irq_lock);
> + vgic_put_irq(kvm, irq);
> +}
> +
> -- 
> 2.5.5
> 

Thanks,
-Christoffer


Re: [PATCH v2 6/8] KVM: arm/arm64: vgic: Implement forwarding setting

2017-07-21 Thread Christoffer Dall
On Thu, Jun 15, 2017 at 02:52:38PM +0200, Eric Auger wrote:
> Implements kvm_vgic_[set|unset]_forwarding.
> 
> Handle low-level VGIC programming and consistent irqchip
> programming.
> 
> Signed-off-by: Eric Auger 
> 
> ---
> 
> v1 -> v2:
> - change the parameter names used in the declaration
> - use kvm_vgic_map/unmap_phys_irq and handle their returned value
> ---
>  include/kvm/arm_vgic.h   |  5 +++
>  virt/kvm/arm/vgic/vgic.c | 88 
> 
>  2 files changed, 93 insertions(+)
> 
> diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
> index cceb31d..5064a57 100644
> --- a/include/kvm/arm_vgic.h
> +++ b/include/kvm/arm_vgic.h
> @@ -343,4 +343,9 @@ int kvm_send_userspace_msi(struct kvm *kvm, struct 
> kvm_msi *msi);
>   */
>  int kvm_vgic_setup_default_irq_routing(struct kvm *kvm);
>  
> +int kvm_vgic_set_forwarding(struct kvm *kvm, unsigned int host_irq,
> + unsigned int vintid);
> +void kvm_vgic_unset_forwarding(struct kvm *kvm, unsigned int host_irq,
> +unsigned int vintid);
> +
>  #endif /* __KVM_ARM_VGIC_H */
> diff --git a/virt/kvm/arm/vgic/vgic.c b/virt/kvm/arm/vgic/vgic.c
> index 2e35ac7..9ee3e77 100644
> --- a/virt/kvm/arm/vgic/vgic.c
> +++ b/virt/kvm/arm/vgic/vgic.c
> @@ -781,3 +781,91 @@ bool kvm_vgic_map_is_active(struct kvm_vcpu *vcpu, 
> unsigned int vintid)
>   return map_is_active;
>  }
>  
> +/**
> + * kvm_vgic_set_forwarding - Set IRQ forwarding
> + *
> + * @kvm: kvm handle
> + * @host_irq: the host linux IRQ
> + * @vintid: the virtual INTID
> + *
> + * This function must be called when the IRQ is not active:
> + * ie. not active at GIC level and not currently under injection
> + * into the guest using the unforwarded mode. The physical IRQ must
> + * be disabled and all vCPUs must have been exited and prevented
> + * from being re-entered.
> + */
> +int kvm_vgic_set_forwarding(struct kvm *kvm, unsigned int host_irq,
> + unsigned int vintid)
> +{
> + struct kvm_vcpu *vcpu;
> + struct vgic_irq *irq;
> + int ret;
> +
> + kvm_debug("%s host_irq=%d vintid=%d\n", __func__, host_irq, vintid);

do you need to check if the vgic is initialized etc. here?

> +
> + if (!vgic_valid_spi(kvm, vintid))
> + return -EINVAL;
> +
> + irq = vgic_get_irq(kvm, NULL, vintid);
> + spin_lock(>irq_lock);
> +
> + if (irq->hw) {
> + ret = -EINVAL;

is this because it's already forwarded?  How about EBUSY or EEXIST
instead then?

> + goto unlock;
> + }
> + vcpu = irq->target_vcpu;
> + if (!vcpu) {
> + ret = -EAGAIN;

what is this case exactly?

> + goto unlock;
> + }
> + 
> + ret = kvm_vgic_map_irq(vcpu, irq, host_irq);
> + if (!ret)
> + irq_set_vcpu_affinity(host_irq, vcpu);

so this is essentially map + set_vcpu_affinity.  Why do we want the GIC
to do this in one go as opposed to leaving it up to the caller?

> +unlock:
> + spin_unlock(>irq_lock);
> + vgic_put_irq(kvm, irq);
> + return ret;
> +}
> +
> +/**
> + * kvm_vgic_unset_forwarding - Unset IRQ forwarding
> + *
> + * @kvm: KVM handle
> + * @host_irq: the host Linux IRQ number
> + * @vintid: virtual INTID
> + *
> + * This function must be called when the host irq is disabled and
> + * all vCPUs have been exited and prevented from being re-entered.
> + */
> +void kvm_vgic_unset_forwarding(struct kvm *kvm,
> +unsigned int host_irq,
> +unsigned int vintid)
> +{
> + struct vgic_irq *irq;
> + bool active;
> +
> + kvm_debug("%s host_irq=%d vintid=%d\n", __func__, host_irq, vintid);

do you need to check if the vgic is initialized etc. here?

> +
> + if (!vgic_valid_spi(kvm, vintid))
> + return;
> +
> + irq = vgic_get_irq(kvm, NULL, vintid);
> + spin_lock(>irq_lock);
> +
> + if (!irq->hw)
> + goto unlock;
> +
> + WARN_ON(irq_get_irqchip_state(host_irq, IRQCHIP_STATE_ACTIVE, ));
> +
> + if (active)
> + irq_set_irqchip_state(host_irq, IRQCHIP_STATE_ACTIVE, false);
> +
> + kvm_vgic_unmap_irq(irq);
> + irq_set_vcpu_affinity(host_irq, NULL);
> +
> +unlock:
> + spin_unlock(>irq_lock);
> + vgic_put_irq(kvm, irq);
> +}
> +
> -- 
> 2.5.5
> 

Thanks,
-Christoffer


Re: [PATCH 0/3] livepatch: introduce atomic replace

2017-07-21 Thread Miroslav Benes
On Wed, 19 Jul 2017, Jason Baron wrote:

> Hi,
> 
> In testing livepatch, I found that when doing cumulative patches, if a patched
> function is completed reverted by a subsequent patch (back to its original 
> state)
> livepatch does not revert the funtion to its original state. Specifically, if
> patch A introduces a change to function 1, and patch B reverts the change to
> function 1 and introduces changes to say function 2 and 3 as well, the change
> that patch A introducd to function 1 is still present. This could be addressed
> by first completely removing patch A (disable and then rmmod) and then 
> inserting
> patch B (insmod and enable), but this leaves an unpatched window. In 
> discussing
> this issue with Josh on the kpatch mailing list, he mentioned that we could 
> get
> 'atomic replace working properly', and that is the direction of this patchset:
> https://www.redhat.com/archives/kpatch/2017-June/msg5.html

Hi Jason,

this has been on my TODO list for a long time now, so thanks for working 
on this. We have the same feature in kGraft and we use it heavily (in fact 
we distribute our patches as cumulative and "replace_all" how we call it).

The forward port of the feature from kGraft is unfortunately not 
straightforward. We do not have a concept of klp_target_state there, so we 
can freely let functions to be patched or reverted in one go. We cannot do 
the same upstream. At first glance, you used nop function exactly for this 
case. Nice hack :).
 
> Patches:
> 
> 1) livepatch: Add klp_object and klp_func iterators
>   Just a prep patch for the 'atomic revert' feature
> 
> 2) livepatch: add atomic replace
>   Core feature
> 
> 3) livepatch: Add a sysctl livepatch_mode for atomic replace
>   Introduces a knob for enabling atomic replace. I hate knobs and perhaps
>   its possible to default to cumulative replace? Although I suspect there
>   are workflows relying on the existing behavior - I'm not sure. It may
>   be desirable to associate the knob with the patch itself as in the
>   'immediate' flag, such that we don't introduce a global sysctl that
>   likely would also need to built-in, if there are patches in the initrd.

Yes. I think it should be associated with the patch itself. This would 
allow more flexible behaviour. You could stack more patches on top of 
"atomic replace" patch.

Anyway, I'm on holiday next week, so I'll take a proper look the week 
after.

Thanks,
Miroslav


Re: [PATCH 0/3] livepatch: introduce atomic replace

2017-07-21 Thread Miroslav Benes
On Wed, 19 Jul 2017, Jason Baron wrote:

> Hi,
> 
> In testing livepatch, I found that when doing cumulative patches, if a patched
> function is completed reverted by a subsequent patch (back to its original 
> state)
> livepatch does not revert the funtion to its original state. Specifically, if
> patch A introduces a change to function 1, and patch B reverts the change to
> function 1 and introduces changes to say function 2 and 3 as well, the change
> that patch A introducd to function 1 is still present. This could be addressed
> by first completely removing patch A (disable and then rmmod) and then 
> inserting
> patch B (insmod and enable), but this leaves an unpatched window. In 
> discussing
> this issue with Josh on the kpatch mailing list, he mentioned that we could 
> get
> 'atomic replace working properly', and that is the direction of this patchset:
> https://www.redhat.com/archives/kpatch/2017-June/msg5.html

Hi Jason,

this has been on my TODO list for a long time now, so thanks for working 
on this. We have the same feature in kGraft and we use it heavily (in fact 
we distribute our patches as cumulative and "replace_all" how we call it).

The forward port of the feature from kGraft is unfortunately not 
straightforward. We do not have a concept of klp_target_state there, so we 
can freely let functions to be patched or reverted in one go. We cannot do 
the same upstream. At first glance, you used nop function exactly for this 
case. Nice hack :).
 
> Patches:
> 
> 1) livepatch: Add klp_object and klp_func iterators
>   Just a prep patch for the 'atomic revert' feature
> 
> 2) livepatch: add atomic replace
>   Core feature
> 
> 3) livepatch: Add a sysctl livepatch_mode for atomic replace
>   Introduces a knob for enabling atomic replace. I hate knobs and perhaps
>   its possible to default to cumulative replace? Although I suspect there
>   are workflows relying on the existing behavior - I'm not sure. It may
>   be desirable to associate the knob with the patch itself as in the
>   'immediate' flag, such that we don't introduce a global sysctl that
>   likely would also need to built-in, if there are patches in the initrd.

Yes. I think it should be associated with the patch itself. This would 
allow more flexible behaviour. You could stack more patches on top of 
"atomic replace" patch.

Anyway, I'm on holiday next week, so I'll take a proper look the week 
after.

Thanks,
Miroslav


Re: [PATCH V3 1/3] sched: cpufreq: Allow remote cpufreq callbacks

2017-07-21 Thread Peter Zijlstra
On Thu, Jul 13, 2017 at 12:14:37PM +0530, Viresh Kumar wrote:
> diff --git a/drivers/cpufreq/cpufreq_governor.c 
> b/drivers/cpufreq/cpufreq_governor.c
> index 47e24b5384b3..606b1a37a1af 100644
> --- a/drivers/cpufreq/cpufreq_governor.c
> +++ b/drivers/cpufreq/cpufreq_governor.c
> @@ -275,6 +275,10 @@ static void dbs_update_util_handler(struct 
> update_util_data *data, u64 time,
>   struct policy_dbs_info *policy_dbs = cdbs->policy_dbs;
>   u64 delta_ns, lst;
>  
> + /* Don't allow remote callbacks */
> + if (smp_processor_id() != data->cpu)
> + return;
> +

The alternative is using some of that policy_dbs->policy->*cpus crud I
suppose, because:

>   /*
>* The work may not be allowed to be queued up right now.
>* Possible reasons:
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index b7fb8b7c980d..4bee2f4cbc28 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -1732,6 +1732,10 @@ static void intel_pstate_update_util_pid(struct 
> update_util_data *data,
>   struct cpudata *cpu = container_of(data, struct cpudata, update_util);
>   u64 delta_ns = time - cpu->sample.time;
>  
> + /* Don't allow remote callbacks */
> + if (smp_processor_id() != data->cpu)
> + return;
> +
>   if ((s64)delta_ns < pid_params.sample_rate_ns)
>   return;
>  
> @@ -1749,6 +1753,10 @@ static void intel_pstate_update_util(struct 
> update_util_data *data, u64 time,
>   struct cpudata *cpu = container_of(data, struct cpudata, update_util);
>   u64 delta_ns;
>  
> + /* Don't allow remote callbacks */
> + if (smp_processor_id() != data->cpu)
> + return;
> +
>   if (flags & SCHED_CPUFREQ_IOWAIT) {
>   cpu->iowait_boost = int_tofp(1);
>   } else if (cpu->iowait_boost) {


For these we can already use cpu->cpu, which would make:

> diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
> index d2be2ccbb372..8256a8f35f22 100644
> --- a/include/linux/sched/cpufreq.h
> +++ b/include/linux/sched/cpufreq.h
> @@ -16,6 +16,7 @@
>  #ifdef CONFIG_CPU_FREQ
>  struct update_util_data {
> void (*func)(struct update_util_data *data, u64 time, unsigned int 
> flags);
> +   unsigned int cpu;
>  };
>  
>  void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
> diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c
> index dbc51442ecbc..ee4c596b71b4 100644
> --- a/kernel/sched/cpufreq.c
> +++ b/kernel/sched/cpufreq.c
> @@ -42,6 +42,7 @@ void cpufreq_add_update_util_hook(int cpu, struct 
> update_util_data *data,
>   return;
>  
>   data->func = func;
> + data->cpu = cpu;
>   rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
>  }
>  EXPORT_SYMBOL_GPL(cpufreq_add_update_util_hook);

redundant.

> diff --git a/kernel/sched/cpufreq_schedutil.c 
> b/kernel/sched/cpufreq_schedutil.c
> index 29a397067ffa..ed9c589e5386 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -218,6 +218,10 @@ static void sugov_update_single(struct update_util_data 
> *hook, u64 time,
>   unsigned int next_f;
>   bool busy;
>  
> + /* Remote callbacks aren't allowed for policies which aren't shared */
> + if (smp_processor_id() != hook->cpu)
> + return;
> +
>   sugov_set_iowait_boost(sg_cpu, time, flags);
>   sg_cpu->last_update = time;
>  
> @@ -290,6 +294,10 @@ static void sugov_update_shared(struct update_util_data 
> *hook, u64 time,
>   unsigned long util, max;
>   unsigned int next_f;
>  
> + /* Don't allow remote callbacks */
> + if (smp_processor_id() != hook->cpu)
> + return;
> +
>   sugov_get_util(, );
>  
>   raw_spin_lock(_policy->update_lock);


Given the whole rq->lock thing, I suspect we could actually not do these
two. That would then continue to process the iowait and other accounting
stuff, but stall the moment we call into the actual driver, which will
then drop the request on the floor as per the first few hunks.

> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index a84299f44b5d..7fcfaee39d19 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1136,7 +1136,7 @@ static void update_curr_dl(struct rq *rq)
>   }
>  
>   /* kick cpufreq (see the comment in kernel/sched/sched.h). */
> - cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_DL);
> + cpufreq_update_util(rq, SCHED_CPUFREQ_DL);
>  
>   schedstat_set(curr->se.statistics.exec_max,
> max(curr->se.statistics.exec_max, delta_exec));
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c95880e216f6..d378d02fdfcb 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3278,7 +3278,9 @@ static inline void set_tg_cfs_propagate(struct cfs_rq 
> *cfs_rq) {}
>  
>  static inline void 

Re: [PATCH V3 1/3] sched: cpufreq: Allow remote cpufreq callbacks

2017-07-21 Thread Peter Zijlstra
On Thu, Jul 13, 2017 at 12:14:37PM +0530, Viresh Kumar wrote:
> diff --git a/drivers/cpufreq/cpufreq_governor.c 
> b/drivers/cpufreq/cpufreq_governor.c
> index 47e24b5384b3..606b1a37a1af 100644
> --- a/drivers/cpufreq/cpufreq_governor.c
> +++ b/drivers/cpufreq/cpufreq_governor.c
> @@ -275,6 +275,10 @@ static void dbs_update_util_handler(struct 
> update_util_data *data, u64 time,
>   struct policy_dbs_info *policy_dbs = cdbs->policy_dbs;
>   u64 delta_ns, lst;
>  
> + /* Don't allow remote callbacks */
> + if (smp_processor_id() != data->cpu)
> + return;
> +

The alternative is using some of that policy_dbs->policy->*cpus crud I
suppose, because:

>   /*
>* The work may not be allowed to be queued up right now.
>* Possible reasons:
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index b7fb8b7c980d..4bee2f4cbc28 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -1732,6 +1732,10 @@ static void intel_pstate_update_util_pid(struct 
> update_util_data *data,
>   struct cpudata *cpu = container_of(data, struct cpudata, update_util);
>   u64 delta_ns = time - cpu->sample.time;
>  
> + /* Don't allow remote callbacks */
> + if (smp_processor_id() != data->cpu)
> + return;
> +
>   if ((s64)delta_ns < pid_params.sample_rate_ns)
>   return;
>  
> @@ -1749,6 +1753,10 @@ static void intel_pstate_update_util(struct 
> update_util_data *data, u64 time,
>   struct cpudata *cpu = container_of(data, struct cpudata, update_util);
>   u64 delta_ns;
>  
> + /* Don't allow remote callbacks */
> + if (smp_processor_id() != data->cpu)
> + return;
> +
>   if (flags & SCHED_CPUFREQ_IOWAIT) {
>   cpu->iowait_boost = int_tofp(1);
>   } else if (cpu->iowait_boost) {


For these we can already use cpu->cpu, which would make:

> diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
> index d2be2ccbb372..8256a8f35f22 100644
> --- a/include/linux/sched/cpufreq.h
> +++ b/include/linux/sched/cpufreq.h
> @@ -16,6 +16,7 @@
>  #ifdef CONFIG_CPU_FREQ
>  struct update_util_data {
> void (*func)(struct update_util_data *data, u64 time, unsigned int 
> flags);
> +   unsigned int cpu;
>  };
>  
>  void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
> diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c
> index dbc51442ecbc..ee4c596b71b4 100644
> --- a/kernel/sched/cpufreq.c
> +++ b/kernel/sched/cpufreq.c
> @@ -42,6 +42,7 @@ void cpufreq_add_update_util_hook(int cpu, struct 
> update_util_data *data,
>   return;
>  
>   data->func = func;
> + data->cpu = cpu;
>   rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
>  }
>  EXPORT_SYMBOL_GPL(cpufreq_add_update_util_hook);

redundant.

> diff --git a/kernel/sched/cpufreq_schedutil.c 
> b/kernel/sched/cpufreq_schedutil.c
> index 29a397067ffa..ed9c589e5386 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -218,6 +218,10 @@ static void sugov_update_single(struct update_util_data 
> *hook, u64 time,
>   unsigned int next_f;
>   bool busy;
>  
> + /* Remote callbacks aren't allowed for policies which aren't shared */
> + if (smp_processor_id() != hook->cpu)
> + return;
> +
>   sugov_set_iowait_boost(sg_cpu, time, flags);
>   sg_cpu->last_update = time;
>  
> @@ -290,6 +294,10 @@ static void sugov_update_shared(struct update_util_data 
> *hook, u64 time,
>   unsigned long util, max;
>   unsigned int next_f;
>  
> + /* Don't allow remote callbacks */
> + if (smp_processor_id() != hook->cpu)
> + return;
> +
>   sugov_get_util(, );
>  
>   raw_spin_lock(_policy->update_lock);


Given the whole rq->lock thing, I suspect we could actually not do these
two. That would then continue to process the iowait and other accounting
stuff, but stall the moment we call into the actual driver, which will
then drop the request on the floor as per the first few hunks.

> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index a84299f44b5d..7fcfaee39d19 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1136,7 +1136,7 @@ static void update_curr_dl(struct rq *rq)
>   }
>  
>   /* kick cpufreq (see the comment in kernel/sched/sched.h). */
> - cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_DL);
> + cpufreq_update_util(rq, SCHED_CPUFREQ_DL);
>  
>   schedstat_set(curr->se.statistics.exec_max,
> max(curr->se.statistics.exec_max, delta_exec));
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c95880e216f6..d378d02fdfcb 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3278,7 +3278,9 @@ static inline void set_tg_cfs_propagate(struct cfs_rq 
> *cfs_rq) {}
>  
>  static inline void 

Re: [PATCH v2 5/8] KVM: arm/arm64: vgic: Handle mapped level sensitive SPIs

2017-07-21 Thread Christoffer Dall
On Fri, Jul 07, 2017 at 09:41:42AM +0200, Auger Eric wrote:
> Hi Marc,
> 
> On 04/07/2017 14:15, Marc Zyngier wrote:
> > Hi Eric,
> > 
> > On 15/06/17 13:52, Eric Auger wrote:
> >> Currently, the line level of unmapped level sensitive SPIs is
> >> toggled down by the maintenance IRQ handler/resamplefd mechanism.
> >>
> >> As mapped SPI completion is not trapped, we cannot rely on this
> >> mechanism and the line level needs to be observed at distributor
> >> level instead.
> >>
> >> This patch handles the physical IRQ case in vgic_validate_injection
> >> and get the line level of a mapped SPI at distributor level.
> >>
> >> Signed-off-by: Eric Auger 
> >>
> >> ---
> >>
> >> v1 -> v2:
> >> - renamed is_unshared_mapped into is_mapped_spi
> >> - changes to kvm_vgic_map_phys_irq moved in the previous patch
> >> - make vgic_validate_injection more readable
> >> - reword the commit message
> >> ---
> >>  virt/kvm/arm/vgic/vgic.c | 16 ++--
> >>  virt/kvm/arm/vgic/vgic.h |  7 ++-
> >>  2 files changed, 20 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/virt/kvm/arm/vgic/vgic.c b/virt/kvm/arm/vgic/vgic.c
> >> index 075f073..2e35ac7 100644
> >> --- a/virt/kvm/arm/vgic/vgic.c
> >> +++ b/virt/kvm/arm/vgic/vgic.c
> >> @@ -139,6 +139,17 @@ void vgic_put_irq(struct kvm *kvm, struct vgic_irq 
> >> *irq)
> >>kfree(irq);
> >>  }
> >>  
> >> +bool irq_line_level(struct vgic_irq *irq)
> >> +{
> >> +  bool line_level = irq->line_level;
> >> +
> >> +  if (unlikely(is_mapped_spi(irq)))
> >> +  WARN_ON(irq_get_irqchip_state(irq->host_irq,
> >> +IRQCHIP_STATE_PENDING,
> >> +_level));
> >> +  return line_level;
> >> +}
> >> +
> >>  /**
> >>   * kvm_vgic_target_oracle - compute the target vcpu for an irq
> >>   *
> >> @@ -236,13 +247,14 @@ static void vgic_sort_ap_list(struct kvm_vcpu *vcpu)
> >>  
> >>  /*
> >>   * Only valid injection if changing level for level-triggered IRQs or for 
> >> a
> >> - * rising edge.
> >> + * rising edge. Injection of virtual interrupts associated to physical
> >> + * interrupts always is valid.
> >>   */
> >>  static bool vgic_validate_injection(struct vgic_irq *irq, bool level)
> >>  {
> >>switch (irq->config) {
> >>case VGIC_CONFIG_LEVEL:
> >> -  return irq->line_level != level;
> >> +  return (irq->line_level != level || 
> >> unlikely(is_mapped_spi(irq)));
> >>case VGIC_CONFIG_EDGE:
> >>return level;
> >>}
> >> diff --git a/virt/kvm/arm/vgic/vgic.h b/virt/kvm/arm/vgic/vgic.h
> >> index bba7fa2..da254ae 100644
> >> --- a/virt/kvm/arm/vgic/vgic.h
> >> +++ b/virt/kvm/arm/vgic/vgic.h
> >> @@ -96,14 +96,19 @@
> >>  /* we only support 64 kB translation table page size */
> >>  #define KVM_ITS_L1E_ADDR_MASK GENMASK_ULL(51, 16)
> >>  
> >> +bool irq_line_level(struct vgic_irq *irq);
> >> +
> >>  static inline bool irq_is_pending(struct vgic_irq *irq)
> >>  {
> >>if (irq->config == VGIC_CONFIG_EDGE)
> >>return irq->pending_latch;
> >>else
> >> -  return irq->pending_latch || irq->line_level;
> >> +  return irq->pending_latch || irq_line_level(irq);
> > 
> > I'm a bit concerned that an edge interrupt doesn't take the distributor
> > state into account here. Why is that so? Once an SPI is forwarded to a
> > guest, a large part of the edge vs level differences move into the HW,
> > and are not that different anymore from a SW PoV.
> 
> As pointed out by Christoffer in https://lkml.org/lkml/2017/6/8/322,
> isn't it a bit risky in general to poke the physical state instead of
> the virtual state. For level sensitive, to me we don't really have many
> other alternatives. For edge, we are not obliged to.

I think we need to be clear on the fundamental question of whether or
not we consider pending_latch and/or line_level for mapped interrupts.

I can definitely see the argument that the pending state is kept in
hardware, so if you want to know that for a mapped interrupt, ask the
hardware.

The upside of this appraoch is a clean separation of state and we avoid
any logic to synchronize a virtual state with the physical state.

The downside is that it's slower to peek into the physical GIC than to
read a variable from memory, and we need to special case the validate
path (which I now understand).

If we move to keeping the state in HW, how do we deal with GICD_SPENDR ?
Does that mean we will forward a from the VM handled by the VGIC to the
physical GIC?

> 
> Don't we have situations, due to the lazy disable approach, where the
> physical IRQ hits, enters the genirq handler and the actual handler is
> not called, ie. the virtual IRQ is not injected?
> 

I'm not sure I remember what these situations were, specifically, but
certainly if we ever have a situation where a mapped irq's pending state
should be different from that of the physical one, then it doesn't work.

Thanks,
-Christoffer


Re: [PATCH v2 5/8] KVM: arm/arm64: vgic: Handle mapped level sensitive SPIs

2017-07-21 Thread Christoffer Dall
On Fri, Jul 07, 2017 at 09:41:42AM +0200, Auger Eric wrote:
> Hi Marc,
> 
> On 04/07/2017 14:15, Marc Zyngier wrote:
> > Hi Eric,
> > 
> > On 15/06/17 13:52, Eric Auger wrote:
> >> Currently, the line level of unmapped level sensitive SPIs is
> >> toggled down by the maintenance IRQ handler/resamplefd mechanism.
> >>
> >> As mapped SPI completion is not trapped, we cannot rely on this
> >> mechanism and the line level needs to be observed at distributor
> >> level instead.
> >>
> >> This patch handles the physical IRQ case in vgic_validate_injection
> >> and get the line level of a mapped SPI at distributor level.
> >>
> >> Signed-off-by: Eric Auger 
> >>
> >> ---
> >>
> >> v1 -> v2:
> >> - renamed is_unshared_mapped into is_mapped_spi
> >> - changes to kvm_vgic_map_phys_irq moved in the previous patch
> >> - make vgic_validate_injection more readable
> >> - reword the commit message
> >> ---
> >>  virt/kvm/arm/vgic/vgic.c | 16 ++--
> >>  virt/kvm/arm/vgic/vgic.h |  7 ++-
> >>  2 files changed, 20 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/virt/kvm/arm/vgic/vgic.c b/virt/kvm/arm/vgic/vgic.c
> >> index 075f073..2e35ac7 100644
> >> --- a/virt/kvm/arm/vgic/vgic.c
> >> +++ b/virt/kvm/arm/vgic/vgic.c
> >> @@ -139,6 +139,17 @@ void vgic_put_irq(struct kvm *kvm, struct vgic_irq 
> >> *irq)
> >>kfree(irq);
> >>  }
> >>  
> >> +bool irq_line_level(struct vgic_irq *irq)
> >> +{
> >> +  bool line_level = irq->line_level;
> >> +
> >> +  if (unlikely(is_mapped_spi(irq)))
> >> +  WARN_ON(irq_get_irqchip_state(irq->host_irq,
> >> +IRQCHIP_STATE_PENDING,
> >> +_level));
> >> +  return line_level;
> >> +}
> >> +
> >>  /**
> >>   * kvm_vgic_target_oracle - compute the target vcpu for an irq
> >>   *
> >> @@ -236,13 +247,14 @@ static void vgic_sort_ap_list(struct kvm_vcpu *vcpu)
> >>  
> >>  /*
> >>   * Only valid injection if changing level for level-triggered IRQs or for 
> >> a
> >> - * rising edge.
> >> + * rising edge. Injection of virtual interrupts associated to physical
> >> + * interrupts always is valid.
> >>   */
> >>  static bool vgic_validate_injection(struct vgic_irq *irq, bool level)
> >>  {
> >>switch (irq->config) {
> >>case VGIC_CONFIG_LEVEL:
> >> -  return irq->line_level != level;
> >> +  return (irq->line_level != level || 
> >> unlikely(is_mapped_spi(irq)));
> >>case VGIC_CONFIG_EDGE:
> >>return level;
> >>}
> >> diff --git a/virt/kvm/arm/vgic/vgic.h b/virt/kvm/arm/vgic/vgic.h
> >> index bba7fa2..da254ae 100644
> >> --- a/virt/kvm/arm/vgic/vgic.h
> >> +++ b/virt/kvm/arm/vgic/vgic.h
> >> @@ -96,14 +96,19 @@
> >>  /* we only support 64 kB translation table page size */
> >>  #define KVM_ITS_L1E_ADDR_MASK GENMASK_ULL(51, 16)
> >>  
> >> +bool irq_line_level(struct vgic_irq *irq);
> >> +
> >>  static inline bool irq_is_pending(struct vgic_irq *irq)
> >>  {
> >>if (irq->config == VGIC_CONFIG_EDGE)
> >>return irq->pending_latch;
> >>else
> >> -  return irq->pending_latch || irq->line_level;
> >> +  return irq->pending_latch || irq_line_level(irq);
> > 
> > I'm a bit concerned that an edge interrupt doesn't take the distributor
> > state into account here. Why is that so? Once an SPI is forwarded to a
> > guest, a large part of the edge vs level differences move into the HW,
> > and are not that different anymore from a SW PoV.
> 
> As pointed out by Christoffer in https://lkml.org/lkml/2017/6/8/322,
> isn't it a bit risky in general to poke the physical state instead of
> the virtual state. For level sensitive, to me we don't really have many
> other alternatives. For edge, we are not obliged to.

I think we need to be clear on the fundamental question of whether or
not we consider pending_latch and/or line_level for mapped interrupts.

I can definitely see the argument that the pending state is kept in
hardware, so if you want to know that for a mapped interrupt, ask the
hardware.

The upside of this appraoch is a clean separation of state and we avoid
any logic to synchronize a virtual state with the physical state.

The downside is that it's slower to peek into the physical GIC than to
read a variable from memory, and we need to special case the validate
path (which I now understand).

If we move to keeping the state in HW, how do we deal with GICD_SPENDR ?
Does that mean we will forward a from the VM handled by the VGIC to the
physical GIC?

> 
> Don't we have situations, due to the lazy disable approach, where the
> physical IRQ hits, enters the genirq handler and the actual handler is
> not called, ie. the virtual IRQ is not injected?
> 

I'm not sure I remember what these situations were, specifically, but
certainly if we ever have a situation where a mapped irq's pending state
should be different from that of the physical one, then it doesn't work.

Thanks,
-Christoffer


[PATCH 1/2] PM / suspend: Use mem_sleep_labels[] strings in messages

2017-07-21 Thread Rafael J. Wysocki
From: Rafael J. Wysocki 

Some messages in suspend.c currently print state names from
pm_states[], but that may be confusing if the mem_sleep sysfs
attribute is changed to anything different from "mem", because
in those cases the messages will say either "freeze" or "standby"
after writing "mem" to /sys/power/state.

To avoid the confusion, use mem_sleep_labels[] strings in those
messages instead.

Signed-off-by: Rafael J. Wysocki 
---
 kernel/power/suspend.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-pm/kernel/power/suspend.c
===
--- linux-pm.orig/kernel/power/suspend.c
+++ linux-pm/kernel/power/suspend.c
@@ -555,7 +555,7 @@ static int enter_state(suspend_state_t s
trace_suspend_resume(TPS("sync_filesystems"), 0, false);
 #endif
 
-   pm_pr_dbg("Preparing system for sleep (%s)\n", pm_states[state]);
+   pm_pr_dbg("Preparing system for sleep (%s)\n", mem_sleep_labels[state]);
pm_suspend_clear_flags();
error = suspend_prepare(state);
if (error)
@@ -565,7 +565,7 @@ static int enter_state(suspend_state_t s
goto Finish;
 
trace_suspend_resume(TPS("suspend_enter"), state, false);
-   pm_pr_dbg("Suspending system (%s)\n", pm_states[state]);
+   pm_pr_dbg("Suspending system (%s)\n", mem_sleep_labels[state]);
pm_restrict_gfp_mask();
error = suspend_devices_and_enter(state);
pm_restore_gfp_mask();
@@ -592,7 +592,7 @@ int pm_suspend(suspend_state_t state)
if (state <= PM_SUSPEND_ON || state >= PM_SUSPEND_MAX)
return -EINVAL;
 
-   pr_info("PM: suspend entry (%s)\n", pm_states[state]);
+   pr_info("PM: suspend entry (%s)\n", mem_sleep_labels[state]);
error = enter_state(state);
if (error) {
suspend_stats.fail++;



[PATCH 1/2] PM / suspend: Use mem_sleep_labels[] strings in messages

2017-07-21 Thread Rafael J. Wysocki
From: Rafael J. Wysocki 

Some messages in suspend.c currently print state names from
pm_states[], but that may be confusing if the mem_sleep sysfs
attribute is changed to anything different from "mem", because
in those cases the messages will say either "freeze" or "standby"
after writing "mem" to /sys/power/state.

To avoid the confusion, use mem_sleep_labels[] strings in those
messages instead.

Signed-off-by: Rafael J. Wysocki 
---
 kernel/power/suspend.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-pm/kernel/power/suspend.c
===
--- linux-pm.orig/kernel/power/suspend.c
+++ linux-pm/kernel/power/suspend.c
@@ -555,7 +555,7 @@ static int enter_state(suspend_state_t s
trace_suspend_resume(TPS("sync_filesystems"), 0, false);
 #endif
 
-   pm_pr_dbg("Preparing system for sleep (%s)\n", pm_states[state]);
+   pm_pr_dbg("Preparing system for sleep (%s)\n", mem_sleep_labels[state]);
pm_suspend_clear_flags();
error = suspend_prepare(state);
if (error)
@@ -565,7 +565,7 @@ static int enter_state(suspend_state_t s
goto Finish;
 
trace_suspend_resume(TPS("suspend_enter"), state, false);
-   pm_pr_dbg("Suspending system (%s)\n", pm_states[state]);
+   pm_pr_dbg("Suspending system (%s)\n", mem_sleep_labels[state]);
pm_restrict_gfp_mask();
error = suspend_devices_and_enter(state);
pm_restore_gfp_mask();
@@ -592,7 +592,7 @@ int pm_suspend(suspend_state_t state)
if (state <= PM_SUSPEND_ON || state >= PM_SUSPEND_MAX)
return -EINVAL;
 
-   pr_info("PM: suspend entry (%s)\n", pm_states[state]);
+   pr_info("PM: suspend entry (%s)\n", mem_sleep_labels[state]);
error = enter_state(state);
if (error) {
suspend_stats.fail++;



[PATCH 0/2] PM / suspend: Messaging updates

2017-07-21 Thread Rafael J. Wysocki
Hi,

The first patch addresses a potential confusion regarding messages printed by
the suspend core code to the kernel log and the second one adds a pr_fmt() to
kernel/power/suspend.c.

The patches are on top of the series I posted yesterday:

http://marc.info/?l=linux-pm=150059650805506=2

Thanks,
Rafael


[PATCH 2/2] PM / suspend: Define pr_fmt() in suspend.c

2017-07-21 Thread Rafael J. Wysocki
From: Rafael J. Wysocki 

Define a common prefix ("PM:") for messages printed by the
code in kernel/power/suspend.c.

Signed-off-by: Rafael J. Wysocki 
---
 kernel/power/suspend.c |   16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

Index: linux-pm/kernel/power/suspend.c
===
--- linux-pm.orig/kernel/power/suspend.c
+++ linux-pm/kernel/power/suspend.c
@@ -8,6 +8,8 @@
  * This file is released under the GPLv2.
  */
 
+#define pr_fmt(fmt) "PM: " fmt
+
 #include 
 #include 
 #include 
@@ -389,7 +391,7 @@ static int suspend_enter(suspend_state_t
 
error = dpm_suspend_late(PMSG_SUSPEND);
if (error) {
-   pr_err("PM: late suspend of devices failed\n");
+   pr_err("late suspend of devices failed\n");
goto Platform_finish;
}
error = platform_suspend_prepare_late(state);
@@ -403,7 +405,7 @@ static int suspend_enter(suspend_state_t
 
error = dpm_suspend_noirq(PMSG_SUSPEND);
if (error) {
-   pr_err("PM: noirq suspend of devices failed\n");
+   pr_err("noirq suspend of devices failed\n");
goto Platform_early_resume;
}
error = platform_suspend_prepare_noirq(state);
@@ -477,7 +479,7 @@ int suspend_devices_and_enter(suspend_st
suspend_test_start();
error = dpm_suspend_start(PMSG_SUSPEND);
if (error) {
-   pr_err("PM: Some devices failed to suspend, or early wake event 
detected\n");
+   pr_err("Some devices failed to suspend, or early wake event 
detected\n");
goto Recover_platform;
}
suspend_test_finish("suspend devices");
@@ -534,7 +536,7 @@ static int enter_state(suspend_state_t s
if (state == PM_SUSPEND_FREEZE) {
 #ifdef CONFIG_PM_DEBUG
if (pm_test_level != TEST_NONE && pm_test_level <= TEST_CPUS) {
-   pr_warn("PM: Unsupported test mode for suspend to idle, 
please choose none/freezer/devices/platform.\n");
+   pr_warn("Unsupported test mode for suspend to idle, 
please choose none/freezer/devices/platform.\n");
return -EAGAIN;
}
 #endif
@@ -549,7 +551,7 @@ static int enter_state(suspend_state_t s
 
 #ifndef CONFIG_SUSPEND_SKIP_SYNC
trace_suspend_resume(TPS("sync_filesystems"), 0, true);
-   pr_info("PM: Syncing filesystems ... ");
+   pr_info("Syncing filesystems ... ");
sys_sync();
pr_cont("done.\n");
trace_suspend_resume(TPS("sync_filesystems"), 0, false);
@@ -592,7 +594,7 @@ int pm_suspend(suspend_state_t state)
if (state <= PM_SUSPEND_ON || state >= PM_SUSPEND_MAX)
return -EINVAL;
 
-   pr_info("PM: suspend entry (%s)\n", mem_sleep_labels[state]);
+   pr_info("suspend entry (%s)\n", mem_sleep_labels[state]);
error = enter_state(state);
if (error) {
suspend_stats.fail++;
@@ -600,7 +602,7 @@ int pm_suspend(suspend_state_t state)
} else {
suspend_stats.success++;
}
-   pr_info("PM: suspend exit\n");
+   pr_info("suspend exit\n");
return error;
 }
 EXPORT_SYMBOL(pm_suspend);



[PATCH 0/2] PM / suspend: Messaging updates

2017-07-21 Thread Rafael J. Wysocki
Hi,

The first patch addresses a potential confusion regarding messages printed by
the suspend core code to the kernel log and the second one adds a pr_fmt() to
kernel/power/suspend.c.

The patches are on top of the series I posted yesterday:

http://marc.info/?l=linux-pm=150059650805506=2

Thanks,
Rafael


[PATCH 2/2] PM / suspend: Define pr_fmt() in suspend.c

2017-07-21 Thread Rafael J. Wysocki
From: Rafael J. Wysocki 

Define a common prefix ("PM:") for messages printed by the
code in kernel/power/suspend.c.

Signed-off-by: Rafael J. Wysocki 
---
 kernel/power/suspend.c |   16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

Index: linux-pm/kernel/power/suspend.c
===
--- linux-pm.orig/kernel/power/suspend.c
+++ linux-pm/kernel/power/suspend.c
@@ -8,6 +8,8 @@
  * This file is released under the GPLv2.
  */
 
+#define pr_fmt(fmt) "PM: " fmt
+
 #include 
 #include 
 #include 
@@ -389,7 +391,7 @@ static int suspend_enter(suspend_state_t
 
error = dpm_suspend_late(PMSG_SUSPEND);
if (error) {
-   pr_err("PM: late suspend of devices failed\n");
+   pr_err("late suspend of devices failed\n");
goto Platform_finish;
}
error = platform_suspend_prepare_late(state);
@@ -403,7 +405,7 @@ static int suspend_enter(suspend_state_t
 
error = dpm_suspend_noirq(PMSG_SUSPEND);
if (error) {
-   pr_err("PM: noirq suspend of devices failed\n");
+   pr_err("noirq suspend of devices failed\n");
goto Platform_early_resume;
}
error = platform_suspend_prepare_noirq(state);
@@ -477,7 +479,7 @@ int suspend_devices_and_enter(suspend_st
suspend_test_start();
error = dpm_suspend_start(PMSG_SUSPEND);
if (error) {
-   pr_err("PM: Some devices failed to suspend, or early wake event 
detected\n");
+   pr_err("Some devices failed to suspend, or early wake event 
detected\n");
goto Recover_platform;
}
suspend_test_finish("suspend devices");
@@ -534,7 +536,7 @@ static int enter_state(suspend_state_t s
if (state == PM_SUSPEND_FREEZE) {
 #ifdef CONFIG_PM_DEBUG
if (pm_test_level != TEST_NONE && pm_test_level <= TEST_CPUS) {
-   pr_warn("PM: Unsupported test mode for suspend to idle, 
please choose none/freezer/devices/platform.\n");
+   pr_warn("Unsupported test mode for suspend to idle, 
please choose none/freezer/devices/platform.\n");
return -EAGAIN;
}
 #endif
@@ -549,7 +551,7 @@ static int enter_state(suspend_state_t s
 
 #ifndef CONFIG_SUSPEND_SKIP_SYNC
trace_suspend_resume(TPS("sync_filesystems"), 0, true);
-   pr_info("PM: Syncing filesystems ... ");
+   pr_info("Syncing filesystems ... ");
sys_sync();
pr_cont("done.\n");
trace_suspend_resume(TPS("sync_filesystems"), 0, false);
@@ -592,7 +594,7 @@ int pm_suspend(suspend_state_t state)
if (state <= PM_SUSPEND_ON || state >= PM_SUSPEND_MAX)
return -EINVAL;
 
-   pr_info("PM: suspend entry (%s)\n", mem_sleep_labels[state]);
+   pr_info("suspend entry (%s)\n", mem_sleep_labels[state]);
error = enter_state(state);
if (error) {
suspend_stats.fail++;
@@ -600,7 +602,7 @@ int pm_suspend(suspend_state_t state)
} else {
suspend_stats.success++;
}
-   pr_info("PM: suspend exit\n");
+   pr_info("suspend exit\n");
return error;
 }
 EXPORT_SYMBOL(pm_suspend);



Re: [PATCH] powerpc/44x/fsp2: correct dtb reg property for /sdhci@020c0000

2017-07-21 Thread Ivan Mikhaylov
Hi Ian,
>Building the split device-tree tree[0] highlighted that upstream commit
>9eec6cb142bd ("powerpc/44x/fsp2: Add device tree for FSP2 board") introduced
>this warning when building the device tree:
>
>$ make CROSS_COMPILE=powerpc-linux-gnu- ARCH=powerpc fsp2.dtb
>  CHK scripts/mod/devicetable-offsets.h
>  DTC arch/powerpc/boot/fsp2.dtb
>arch/powerpc/boot/fsp2.dtb: Warning (reg_format): "reg" property in 
>/sdhci@020c has invalid length (8 bytes) (#address-cells == 2, #size-cells 
>== 1)
>
>This commit adds the second adress cell as zeroes to resolve the warning. Note:
>I have no access to or information about this platform so this is purely a
>guess as to the fix. An alternative would be to adjust #address-cells, but
>whether that is correct or not depends on the platform.

Yes, this problem exists on this tag but it is already fixed and waiting for
review by this https://patchwork.kernel.org/patch/9819379/ . You can check it
if you want, anyways it will go to powerpc next branch first.

Thank you.



Re: [PATCH] powerpc/44x/fsp2: correct dtb reg property for /sdhci@020c0000

2017-07-21 Thread Ivan Mikhaylov
Hi Ian,
>Building the split device-tree tree[0] highlighted that upstream commit
>9eec6cb142bd ("powerpc/44x/fsp2: Add device tree for FSP2 board") introduced
>this warning when building the device tree:
>
>$ make CROSS_COMPILE=powerpc-linux-gnu- ARCH=powerpc fsp2.dtb
>  CHK scripts/mod/devicetable-offsets.h
>  DTC arch/powerpc/boot/fsp2.dtb
>arch/powerpc/boot/fsp2.dtb: Warning (reg_format): "reg" property in 
>/sdhci@020c has invalid length (8 bytes) (#address-cells == 2, #size-cells 
>== 1)
>
>This commit adds the second adress cell as zeroes to resolve the warning. Note:
>I have no access to or information about this platform so this is purely a
>guess as to the fix. An alternative would be to adjust #address-cells, but
>whether that is correct or not depends on the platform.

Yes, this problem exists on this tag but it is already fixed and waiting for
review by this https://patchwork.kernel.org/patch/9819379/ . You can check it
if you want, anyways it will go to powerpc next branch first.

Thank you.



<    4   5   6   7   8   9   10   11   12   13   >