date:20180523

[PATCH v2 3/5] PCI/AER: Add sysfs attributes to provide breakdown of AERs

2018-05-23 Thread Rajat Jain

Add sysfs attributes to provide breakdown of the AERs seen,
into different type of correctable or uncorrectable errors:

dev_breakdown_correctable
dev_breakdown_uncorrectable

Signed-off-by: Rajat Jain 
---
v2: Use tabs instead of spaces, fix the subject, and print
all non zero counters.

 drivers/pci/pcie/aer/aerdrv.h  |  6 ++
 drivers/pci/pcie/aer/aerdrv_errprint.c |  6 --
 drivers/pci/pcie/aer/aerdrv_stats.c| 28 ++
 3 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
index b5d5ad6f2c03..048fbd7c9633 100644
--- a/drivers/pci/pcie/aer/aerdrv.h
+++ b/drivers/pci/pcie/aer/aerdrv.h
@@ -89,6 +89,12 @@ int pci_aer_stats_init(struct pci_dev *pdev);
 void pci_aer_stats_exit(struct pci_dev *pdev);
 void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info);
 
+extern const char
+*aer_correctable_error_string[AER_MAX_TYPEOF_CORRECTABLE_ERRS];
+
+extern const char
+*aer_uncorrectable_error_string[AER_MAX_TYPEOF_UNCORRECTABLE_ERRS];
+
 #ifdef CONFIG_ACPI_APEI
 int pcie_aer_get_firmware_first(struct pci_dev *pci_dev);
 #else
diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c 
b/drivers/pci/pcie/aer/aerdrv_errprint.c
index 5e8b98deda08..5585f309f1a8 100644
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -68,7 +68,8 @@ static const char *aer_error_layer[] = {
"Transaction Layer"
 };
 
-static const char *aer_correctable_error_string[] = {
+const char
+*aer_correctable_error_string[AER_MAX_TYPEOF_CORRECTABLE_ERRS] = {
"Receiver Error",   /* Bit Position 0   */
NULL,
NULL,
@@ -87,7 +88,8 @@ static const char *aer_correctable_error_string[] = {
"Header Log Overflow",  /* Bit Position 15  */
 };
 
-static const char *aer_uncorrectable_error_string[] = {
+const char
+*aer_uncorrectable_error_string[AER_MAX_TYPEOF_UNCORRECTABLE_ERRS] = {
"Undefined",/* Bit Position 0   */
NULL,
NULL,
diff --git a/drivers/pci/pcie/aer/aerdrv_stats.c 
b/drivers/pci/pcie/aer/aerdrv_stats.c
index beffef2b..e47321b267f6 100644
--- a/drivers/pci/pcie/aer/aerdrv_stats.c
+++ b/drivers/pci/pcie/aer/aerdrv_stats.c
@@ -58,10 +58,38 @@ aer_stats_aggregate_attr(dev_total_cor_errs);
 aer_stats_aggregate_attr(dev_total_fatal_errs);
 aer_stats_aggregate_attr(dev_total_nonfatal_errs);
 
+#define aer_stats_breakdown_attr(field, stats_array, strings_array)\
+   static ssize_t  \
+   field##_show(struct device *dev, struct device_attribute *attr, \
+char *buf) \
+{  \
+   unsigned int i; \
+   char *str = buf;\
+   struct pci_dev *pdev = to_pci_dev(dev); \
+   u64 *stats = pdev->aer_stats->stats_array;  \
+   for (i = 0; i < ARRAY_SIZE(strings_array); i++) {   \
+   if (strings_array[i])   \
+   str += sprintf(str, "%s = 0x%llx\n",\
+  strings_array[i], stats[i]); \
+   else if (stats[i])  \
+   str += sprintf(str, #stats_array "bit[%d] = 0x%llx\n",\
+  i, stats[i]);\
+   }   \
+   return str-buf; \
+}  \
+static DEVICE_ATTR_RO(field)
+
+aer_stats_breakdown_attr(dev_breakdown_correctable, dev_cor_errs,
+aer_correctable_error_string);
+aer_stats_breakdown_attr(dev_breakdown_uncorrectable, dev_uncor_errs,
+aer_uncorrectable_error_string);
+
 static struct attribute *aer_stats_attrs[] __ro_after_init = {
_attr_dev_total_cor_errs.attr,
_attr_dev_total_fatal_errs.attr,
_attr_dev_total_nonfatal_errs.attr,
+   _attr_dev_breakdown_correctable.attr,
+   _attr_dev_breakdown_uncorrectable.attr,
NULL
 };
 
-- 
2.17.0.441.gb46fe60e1d-goog

Re: [PATCH] powerpc/xmon: really enable xmon when a breakpoint is set

2018-05-23 Thread Michal Suchánek

On Tue, 22 May 2018 12:53:53 +0530
Vaibhav Jain  wrote:

> Thanks for the patch Michal,
> 
> Michal Suchanek  writes:
> 
> > When single-stepping kernel code from xmon without a debug hook
> > enabled the kernel crashes. This can happen when kernel starts with
> > xmon on crash disabled but xmon is entered using sysrq.
> >
> > Commit e1368d0c9edb ("powerpc/xmon: Setup debugger hooks when first
> > break-point is set") adds force_enable_xmon function that prints
> > "xmon: Enabling debugger hooks" but does not enable them.  
> Debugger hooks are enabled just befores debugger() is entered from
> sysrq_handle_xmon(). Thats why force_enable_xmon() simply sets sets
> 'xmon_on=1' and exits.
> 
> The problem you are seeing is probably due to sysrq_handle_xmon()
> clearing the debugger hooks on return from debugger() as xmon_on was
> never set for the 's' xmon command.

Indeed, setting xmon_on is sufficient. Will resend the patch.

Thanks

Michal

> 
> > Add the call to xmon_init to install the debugger hooks in
> > force_enable_xmon and also call force_enable_xmon when
> > single-stepping in xmon.  
> Only calling force_enable_xmon() from do_step() should be suffice as
> on exit from the debugger() the value of xmon_on is checked and if
> required the debugger hooks are kept instead of getting cleared.
> 
> 
> >  arch/powerpc/xmon/xmon.c | 5 +
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
> > index a0842f1ff72c..504bd1c3d8b0 100644
> > --- a/arch/powerpc/xmon/xmon.c
> > +++ b/arch/powerpc/xmon/xmon.c  
> 
> 
> > @@ -1275,6 +1279,7 @@ static inline void force_enable_xmon(void)
> > if (!xmon_on) {
> > printf("xmon: Enabling debugger hooks\n");
> > xmon_on = 1;
> > +   xmon_init(1);
> > }
> >  }  
> 
> As mentioned above call to force_enable_xmon() is usually done in
> context of sysrq_handle_xmon() which sets the debugger hooks as soon
> as its entered. So I think that this hunk is not really needed.
> 
>

[PATCH v2 5/5] Documentation/ABI: Add details of PCI AER statistics

2018-05-23 Thread Rajat Jain

Add the PCI AER statistics details to
Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
and provide a pointer to it in
Documentation/PCI/pcieaer-howto.txt

Signed-off-by: Rajat Jain 
---
v2: Move the documentation to Documentation/ABI/

 .../testing/sysfs-bus-pci-devices-aer_stats   | 103 ++
 Documentation/PCI/pcieaer-howto.txt   |   5 +
 2 files changed, 108 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats

diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats 
b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
new file mode 100644
index ..f55c389290ac
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
@@ -0,0 +1,103 @@
+==
+PCIe Device AER statistics
+==
+These attributes show up under all the devices that are AER capable. These
+statistical counters indicate the errors "as seen/reported by the device".
+Note that this may mean that if an end point is causing problems, the AER
+counters may increment at its link partner (e.g. root port) because the
+errors will be "seen" / reported by the link partner and not the the
+problematic end point itself (which may report all counters as 0 as it never
+saw any problems).
+
+Where: /sys/bus/pci/devices//aer_stats/dev_total_cor_errs
+Date:  May 2018
+Kernel Version: 4.17.0
+Contact:   linux-...@vger.kernel.org, raja...@google.com
+Description:   Total number of correctable errors seen and reported by this
+   PCI device using ERR_COR.
+
+Where: /sys/bus/pci/devices//aer_stats/dev_total_fatal_errs
+Date:  May 2018
+Kernel Version: 4.17.0
+Contact:   linux-...@vger.kernel.org, raja...@google.com
+Description:   Total number of uncorrectable fatal errors seen and reported
+   by this PCI device using ERR_FATAL.
+
+Where: /sys/bus/pci/devices//aer_stats/dev_total_nonfatal_errs
+Date:  May 2018
+Kernel Version: 4.17.0
+Contact:   linux-...@vger.kernel.org, raja...@google.com
+Description:   Total number of uncorrectable non-fatal errors seen and reported
+   by this PCI device using ERR_NONFATAL.
+
+Where: /sys/bus/pci/devices//aer_stats/dev_breakdown_correctable
+Date:  May 2018
+Kernel Version: 4.17.0
+Contact:   linux-...@vger.kernel.org, raja...@google.com
+Description:   Breakdown of of correctable errors seen and reported by this
+   PCI device using ERR_COR. A sample result looks like this:
+-
+Receiver Error = 0x174
+Bad TLP = 0x19
+Bad DLLP = 0x3
+RELAY_NUM Rollover = 0x0
+Replay Timer Timeout = 0x1
+Advisory Non-Fatal = 0x0
+Corrected Internal Error = 0x0
+Header Log Overflow = 0x0
+-
+
+Where: /sys/bus/pci/devices//aer_stats/dev_breakdown_uncorrectable
+Date:  May 2018
+Kernel Version: 4.17.0
+Contact:   linux-...@vger.kernel.org, raja...@google.com
+Description:   Breakdown of of correctable errors seen and reported by this
+   PCI device using ERR_FATAL or ERR_NONFATAL. A sample result
+   looks like this:
+-
+Undefined = 0x0
+Data Link Protocol = 0x0
+Surprise Down Error = 0x0
+Poisoned TLP = 0x0
+Flow Control Protocol = 0x0
+Completion Timeout = 0x0
+Completer Abort = 0x0
+Unexpected Completion = 0x0
+Receiver Overflow = 0x0
+Malformed TLP = 0x0
+ECRC = 0x0
+Unsupported Request = 0x0
+ACS Violation = 0x0
+Uncorrectable Internal Error = 0x0
+MC Blocked TLP = 0x0
+AtomicOp Egress Blocked = 0x0
+TLP Prefix Blocked Error = 0x0
+-
+
+
+PCIe Rootport AER statistics
+
+These attributes showup under only the rootports that are AER capable. These
+indicate the number of error messages as "reported to" the rootport. Please 
note
+that the rootports also transmit (internally) the ERR_* messages for errors 
seen
+by the internal rootport PCI device, so these counters includes them and are
+thus cumulative of all the error messages on the PCI hierarchy originating
+at that root port.
+
+Where: /sys/bus/pci/devices//aer_stats/rootport_total_cor_errs
+Date:  May 2018
+Kernel Version: 4.17.0
+Contact:   linux-...@vger.kernel.org, raja...@google.com
+Description:   Total number of ERR_COR messages reported to rootport.
+
+Where: /sys/bus/pci/devices//aer_stats/rootport_total_fatal_errs
+Date:  May 2018
+Kernel Version: 4.17.0
+Contact:   linux-...@vger.kernel.org, raja...@google.com
+Description:   Total number of ERR_FATAL messages reported to rootport.
+
+Where: /sys/bus/pci/devices//aer_stats/rootport_total_nonfatal_errs
+Date:  May 2018
+Kernel Version: 4.17.0
+Contact:   linux-...@vger.kernel.org, raja...@google.com

[PATCH v2 5/5] Documentation/ABI: Add details of PCI AER statistics

2018-05-23 Thread Rajat Jain

Add the PCI AER statistics details to
Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
and provide a pointer to it in
Documentation/PCI/pcieaer-howto.txt

Signed-off-by: Rajat Jain 
---
v2: Move the documentation to Documentation/ABI/

 .../testing/sysfs-bus-pci-devices-aer_stats   | 103 ++
 Documentation/PCI/pcieaer-howto.txt   |   5 +
 2 files changed, 108 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats

diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats 
b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
new file mode 100644
index ..f55c389290ac
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
@@ -0,0 +1,103 @@
+==
+PCIe Device AER statistics
+==
+These attributes show up under all the devices that are AER capable. These
+statistical counters indicate the errors "as seen/reported by the device".
+Note that this may mean that if an end point is causing problems, the AER
+counters may increment at its link partner (e.g. root port) because the
+errors will be "seen" / reported by the link partner and not the the
+problematic end point itself (which may report all counters as 0 as it never
+saw any problems).
+
+Where: /sys/bus/pci/devices//aer_stats/dev_total_cor_errs
+Date:  May 2018
+Kernel Version: 4.17.0
+Contact:   linux-...@vger.kernel.org, raja...@google.com
+Description:   Total number of correctable errors seen and reported by this
+   PCI device using ERR_COR.
+
+Where: /sys/bus/pci/devices//aer_stats/dev_total_fatal_errs
+Date:  May 2018
+Kernel Version: 4.17.0
+Contact:   linux-...@vger.kernel.org, raja...@google.com
+Description:   Total number of uncorrectable fatal errors seen and reported
+   by this PCI device using ERR_FATAL.
+
+Where: /sys/bus/pci/devices//aer_stats/dev_total_nonfatal_errs
+Date:  May 2018
+Kernel Version: 4.17.0
+Contact:   linux-...@vger.kernel.org, raja...@google.com
+Description:   Total number of uncorrectable non-fatal errors seen and reported
+   by this PCI device using ERR_NONFATAL.
+
+Where: /sys/bus/pci/devices//aer_stats/dev_breakdown_correctable
+Date:  May 2018
+Kernel Version: 4.17.0
+Contact:   linux-...@vger.kernel.org, raja...@google.com
+Description:   Breakdown of of correctable errors seen and reported by this
+   PCI device using ERR_COR. A sample result looks like this:
+-
+Receiver Error = 0x174
+Bad TLP = 0x19
+Bad DLLP = 0x3
+RELAY_NUM Rollover = 0x0
+Replay Timer Timeout = 0x1
+Advisory Non-Fatal = 0x0
+Corrected Internal Error = 0x0
+Header Log Overflow = 0x0
+-
+
+Where: /sys/bus/pci/devices//aer_stats/dev_breakdown_uncorrectable
+Date:  May 2018
+Kernel Version: 4.17.0
+Contact:   linux-...@vger.kernel.org, raja...@google.com
+Description:   Breakdown of of correctable errors seen and reported by this
+   PCI device using ERR_FATAL or ERR_NONFATAL. A sample result
+   looks like this:
+-
+Undefined = 0x0
+Data Link Protocol = 0x0
+Surprise Down Error = 0x0
+Poisoned TLP = 0x0
+Flow Control Protocol = 0x0
+Completion Timeout = 0x0
+Completer Abort = 0x0
+Unexpected Completion = 0x0
+Receiver Overflow = 0x0
+Malformed TLP = 0x0
+ECRC = 0x0
+Unsupported Request = 0x0
+ACS Violation = 0x0
+Uncorrectable Internal Error = 0x0
+MC Blocked TLP = 0x0
+AtomicOp Egress Blocked = 0x0
+TLP Prefix Blocked Error = 0x0
+-
+
+
+PCIe Rootport AER statistics
+
+These attributes showup under only the rootports that are AER capable. These
+indicate the number of error messages as "reported to" the rootport. Please 
note
+that the rootports also transmit (internally) the ERR_* messages for errors 
seen
+by the internal rootport PCI device, so these counters includes them and are
+thus cumulative of all the error messages on the PCI hierarchy originating
+at that root port.
+
+Where: /sys/bus/pci/devices//aer_stats/rootport_total_cor_errs
+Date:  May 2018
+Kernel Version: 4.17.0
+Contact:   linux-...@vger.kernel.org, raja...@google.com
+Description:   Total number of ERR_COR messages reported to rootport.
+
+Where: /sys/bus/pci/devices//aer_stats/rootport_total_fatal_errs
+Date:  May 2018
+Kernel Version: 4.17.0
+Contact:   linux-...@vger.kernel.org, raja...@google.com
+Description:   Total number of ERR_FATAL messages reported to rootport.
+
+Where: /sys/bus/pci/devices//aer_stats/rootport_total_nonfatal_errs
+Date:  May 2018
+Kernel Version: 4.17.0
+Contact:   linux-...@vger.kernel.org, raja...@google.com
+Description:   Total number

Re: [PATCH] tpm: fix race condition in tpm_common_write()

2018-05-23 Thread Tadeusz Struk

On 05/23/2018 06:23 AM, Jarkko Sakkinen wrote:
> Ouch o_O Do you have a fixes tag for this one?
> 

This one is quite tricky.
The original bug was introduced by abce9ac292e13 (tpm: Propagate error from 
tpm_transmit to fix a timeout hang)
and the code back then was in drivers/char/tpm/tpm-interface.c file

Then there were two other commits that moved the code around:
afdba32e2a9ea (tpm: Pull everything related to /dev/tpmX into tpm-dev.c)
which moved it from drivers/char/tpm/tpm-interface.c into 
drivers/char/tpm/tpm-dev.c

and last one, ecb38e2f521b0 (tpm: split out tpm-dev.c into tpm-dev.c and 
tpm-common-dev.c)
which moved it from drivers/char/tpm/tpm-dev.c into tpm-common-dev.c

I have no idea how to tag it. Maybe we can use:
Fixes: ecb38e2f521b0 ("tpm: split out tpm-dev.c into tpm-dev.c and 
tpm-common-dev.c")

And then it probably needs to be back ported manually all the way back to 
abce9ac292e13.

Thanks,
-- 
Tadeusz

Re: [PATCH] tpm: fix race condition in tpm_common_write()

2018-05-23 Thread Tadeusz Struk

On 05/23/2018 06:23 AM, Jarkko Sakkinen wrote:
> Ouch o_O Do you have a fixes tag for this one?
> 

This one is quite tricky.
The original bug was introduced by abce9ac292e13 (tpm: Propagate error from 
tpm_transmit to fix a timeout hang)
and the code back then was in drivers/char/tpm/tpm-interface.c file

Then there were two other commits that moved the code around:
afdba32e2a9ea (tpm: Pull everything related to /dev/tpmX into tpm-dev.c)
which moved it from drivers/char/tpm/tpm-interface.c into 
drivers/char/tpm/tpm-dev.c

and last one, ecb38e2f521b0 (tpm: split out tpm-dev.c into tpm-dev.c and 
tpm-common-dev.c)
which moved it from drivers/char/tpm/tpm-dev.c into tpm-common-dev.c

I have no idea how to tag it. Maybe we can use:
Fixes: ecb38e2f521b0 ("tpm: split out tpm-dev.c into tpm-dev.c and 
tpm-common-dev.c")

And then it probably needs to be back ported manually all the way back to 
abce9ac292e13.

Thanks,
-- 
Tadeusz

[PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue

2018-05-23 Thread Tejun Heo

>From 0aa2e9b921d6db71150633ff290199554f0842a8 Mon Sep 17 00:00:00 2001
From: Tejun Heo 
Date: Wed, 23 May 2018 10:29:00 -0700

cgwb_release() punts the actual release to cgwb_release_workfn() on
system_wq.  Depending on the number of cgroups or block devices, there
can be a lot of cgwb_release_workfn() in flight at the same time.

We're periodically seeing close to 256 kworkers getting stuck with the
following stack trace and overtime the entire system gets stuck.

  [] _synchronize_rcu_expedited.constprop.72+0x2fc/0x330
  [] synchronize_rcu_expedited+0x24/0x30
  [] bdi_unregister+0x53/0x290
  [] release_bdi+0x89/0xc0
  [] wb_exit+0x85/0xa0
  [] cgwb_release_workfn+0x54/0xb0
  [] process_one_work+0x150/0x410
  [] worker_thread+0x6d/0x520
  [] kthread+0x12c/0x160
  [] ret_from_fork+0x29/0x40
  [] 0x

The events leading to the lockup are...

1. A lot of cgwb_release_workfn() is queued at the same time and all
   system_wq kworkers are assigned to execute them.

2. They all end up calling synchronize_rcu_expedited().  One of them
   wins and tries to perform the expedited synchronization.

3. However, that invovles queueing rcu_exp_work to system_wq and
   waiting for it.  Because #1 is holding all available kworkers on
   system_wq, rcu_exp_work can't be executed.  cgwb_release_workfn()
   is waiting for synchronize_rcu_expedited() which in turn is waiting
   for cgwb_release_workfn() to free up some of the kworkers.

We shouldn't be scheduling hundreds of cgwb_release_workfn() at the
same time.  There's nothing to be gained from that.  This patch
updates cgwb release path to use a dedicated percpu workqueue with
@max_active of 1.

While this resolves the problem at hand, it might be a good idea to
isolate rcu_exp_work to its own workqueue too as it can be used from
various paths and is prone to this sort of indirect A-A deadlocks.

Signed-off-by: Tejun Heo 
Cc: "Paul E. McKenney" 
Cc: sta...@vger.kernel.org
---
 mm/backing-dev.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7441bd9..8fe3ebd 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -412,6 +412,7 @@ static void wb_exit(struct bdi_writeback *wb)
  * protected.
  */
 static DEFINE_SPINLOCK(cgwb_lock);
+static struct workqueue_struct *cgwb_release_wq;

 /**
  * wb_congested_get_create - get or create a wb_congested
@@ -522,7 +523,7 @@ static void cgwb_release(struct percpu_ref *refcnt)
 {
struct bdi_writeback *wb = container_of(refcnt, struct bdi_writeback,
refcnt);
-   schedule_work(>release_work);
+   queue_work(cgwb_release_wq, >release_work);
 }

 static void cgwb_kill(struct bdi_writeback *wb)
@@ -784,6 +785,21 @@ static void cgwb_bdi_register(struct backing_dev_info *bdi)
spin_unlock_irq(_lock);
 }

+static int __init cgwb_init(void)
+{
+   /*
+* There can be many concurrent release work items overwhelming
+* system_wq.  Put them in a separate wq and limit concurrency.
+* There's no point in executing many of these in parallel.
+*/
+   cgwb_release_wq = alloc_workqueue("cgwb_release", 0, 1);
+   if (!cgwb_release_wq)
+   return -ENOMEM;
+
+   return 0;
+}
+subsys_initcall(cgwb_init);
+
 #else  /* CONFIG_CGROUP_WRITEBACK */

 static int cgwb_bdi_init(struct backing_dev_info *bdi)
-- 
2.9.5

[PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue

2018-05-23 Thread Tejun Heo

>From 0aa2e9b921d6db71150633ff290199554f0842a8 Mon Sep 17 00:00:00 2001
From: Tejun Heo 
Date: Wed, 23 May 2018 10:29:00 -0700

cgwb_release() punts the actual release to cgwb_release_workfn() on
system_wq.  Depending on the number of cgroups or block devices, there
can be a lot of cgwb_release_workfn() in flight at the same time.

We're periodically seeing close to 256 kworkers getting stuck with the
following stack trace and overtime the entire system gets stuck.

  [] _synchronize_rcu_expedited.constprop.72+0x2fc/0x330
  [] synchronize_rcu_expedited+0x24/0x30
  [] bdi_unregister+0x53/0x290
  [] release_bdi+0x89/0xc0
  [] wb_exit+0x85/0xa0
  [] cgwb_release_workfn+0x54/0xb0
  [] process_one_work+0x150/0x410
  [] worker_thread+0x6d/0x520
  [] kthread+0x12c/0x160
  [] ret_from_fork+0x29/0x40
  [] 0x

The events leading to the lockup are...

1. A lot of cgwb_release_workfn() is queued at the same time and all
   system_wq kworkers are assigned to execute them.

2. They all end up calling synchronize_rcu_expedited().  One of them
   wins and tries to perform the expedited synchronization.

3. However, that invovles queueing rcu_exp_work to system_wq and
   waiting for it.  Because #1 is holding all available kworkers on
   system_wq, rcu_exp_work can't be executed.  cgwb_release_workfn()
   is waiting for synchronize_rcu_expedited() which in turn is waiting
   for cgwb_release_workfn() to free up some of the kworkers.

We shouldn't be scheduling hundreds of cgwb_release_workfn() at the
same time.  There's nothing to be gained from that.  This patch
updates cgwb release path to use a dedicated percpu workqueue with
@max_active of 1.

While this resolves the problem at hand, it might be a good idea to
isolate rcu_exp_work to its own workqueue too as it can be used from
various paths and is prone to this sort of indirect A-A deadlocks.

Signed-off-by: Tejun Heo 
Cc: "Paul E. McKenney" 
Cc: sta...@vger.kernel.org
---
 mm/backing-dev.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7441bd9..8fe3ebd 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -412,6 +412,7 @@ static void wb_exit(struct bdi_writeback *wb)
  * protected.
  */
 static DEFINE_SPINLOCK(cgwb_lock);
+static struct workqueue_struct *cgwb_release_wq;

 /**
  * wb_congested_get_create - get or create a wb_congested
@@ -522,7 +523,7 @@ static void cgwb_release(struct percpu_ref *refcnt)
 {
struct bdi_writeback *wb = container_of(refcnt, struct bdi_writeback,
refcnt);
-   schedule_work(>release_work);
+   queue_work(cgwb_release_wq, >release_work);
 }

 static void cgwb_kill(struct bdi_writeback *wb)
@@ -784,6 +785,21 @@ static void cgwb_bdi_register(struct backing_dev_info *bdi)
spin_unlock_irq(_lock);
 }

+static int __init cgwb_init(void)
+{
+   /*
+* There can be many concurrent release work items overwhelming
+* system_wq.  Put them in a separate wq and limit concurrency.
+* There's no point in executing many of these in parallel.
+*/
+   cgwb_release_wq = alloc_workqueue("cgwb_release", 0, 1);
+   if (!cgwb_release_wq)
+   return -ENOMEM;
+
+   return 0;
+}
+subsys_initcall(cgwb_init);
+
 #else  /* CONFIG_CGROUP_WRITEBACK */

 static int cgwb_bdi_init(struct backing_dev_info *bdi)
-- 
2.9.5

Re: B53 DSA switch problem on Banana Pi-R1 on Fedora 26

2018-05-23 Thread Florian Fainelli

On 05/23/2018 10:35 AM, Gerhard Wiesinger wrote:
> On 23.05.2018 17:28, Florian Fainelli wrote:
>>
>>> And in the future (time plan)?
>> If you don't care about multicast then you can use those patches:
>>
>> https://github.com/ffainelli/linux/commit/de055bf5f34e9806463ab2793e0852f5dfc380df
>>
>>
>> and you have to change the part of drivers/net/dsa/b53/b53_common.c that
>> returns DSA_TAG_PROTO_NONE for 53125:
>>
>>
>> diff --git a/drivers/net/dsa/b53/b53_common.c
>> b/drivers/net/dsa/b53/b53_common.c
>> index 9f561fe505cb..3c64f026a8ce 100644
>> --- a/drivers/net/dsa/b53/b53_common.c
>> +++ b/drivers/net/dsa/b53/b53_common.c
>> @@ -1557,7 +1557,7 @@ enum dsa_tag_protocol b53_get_tag_protocol(struct
>> dsa_switch *ds, int port)
>>   * mode to be turned on which means we need to specifically
>> manage ARL
>>   * misses on multicast addresses (TBD).
>>   */
>> -   if (is5325(dev) || is5365(dev) || is539x(dev) || is531x5(dev) ||
>> +   if (is5325(dev) || is5365(dev) || is539x(dev) ||
>>  !b53_can_enable_brcm_tags(ds, port))
>>  return DSA_TAG_PROTO_NONE;
>>
>>
>> That would bring Broadcom tags to the 53125 switch and you would be able
>> to use the configuration lines from Andrew in that case.
> 
> What's the plan here regarding these 2 config option mode (how do you
> call them?)?

Broadcom tags is the underlying feature that provides per-port
information about the packets going in and out. Turning on Broadcom tags
requires turning on managed mode which means that the host now has to
manage how MAC addresses are programmed into the switch, it's not rocket
science, but I don't have a good test framework to automate the testing
of those changes yet. If you are willing to help in the testing, I can
certainly give you patches to try.

> 
> I mean, will this be a breaking change in the future where config has to
> be done in a different way then?

When Broadcom tags are enabled the switch gets usable the way Andrew
expressed it, the only difference that makes on your configuration if
you want e.g: VLAN 101 to be for port 1-4 and VLAN 102 to be for port 5,
is that you no longer create an eth0.101 and eth0.102, but you create
br0.101 and br0.102.

> 
> Or will it be configurable via module parameters or /proc or /sys
> filesystem options?

We might be able to expose a sysfs attribute which shows the type of
tagging being enabled by a particular switch, that way scripts can
detect which variant: configuring the host controller or the bridge is
required. Would that be acceptable?
-- 
Florian

Re: B53 DSA switch problem on Banana Pi-R1 on Fedora 26

2018-05-23 Thread Florian Fainelli

On 05/23/2018 10:35 AM, Gerhard Wiesinger wrote:
> On 23.05.2018 17:28, Florian Fainelli wrote:
>>
>>> And in the future (time plan)?
>> If you don't care about multicast then you can use those patches:
>>
>> https://github.com/ffainelli/linux/commit/de055bf5f34e9806463ab2793e0852f5dfc380df
>>
>>
>> and you have to change the part of drivers/net/dsa/b53/b53_common.c that
>> returns DSA_TAG_PROTO_NONE for 53125:
>>
>>
>> diff --git a/drivers/net/dsa/b53/b53_common.c
>> b/drivers/net/dsa/b53/b53_common.c
>> index 9f561fe505cb..3c64f026a8ce 100644
>> --- a/drivers/net/dsa/b53/b53_common.c
>> +++ b/drivers/net/dsa/b53/b53_common.c
>> @@ -1557,7 +1557,7 @@ enum dsa_tag_protocol b53_get_tag_protocol(struct
>> dsa_switch *ds, int port)
>>   * mode to be turned on which means we need to specifically
>> manage ARL
>>   * misses on multicast addresses (TBD).
>>   */
>> -   if (is5325(dev) || is5365(dev) || is539x(dev) || is531x5(dev) ||
>> +   if (is5325(dev) || is5365(dev) || is539x(dev) ||
>>  !b53_can_enable_brcm_tags(ds, port))
>>  return DSA_TAG_PROTO_NONE;
>>
>>
>> That would bring Broadcom tags to the 53125 switch and you would be able
>> to use the configuration lines from Andrew in that case.
> 
> What's the plan here regarding these 2 config option mode (how do you
> call them?)?

Broadcom tags is the underlying feature that provides per-port
information about the packets going in and out. Turning on Broadcom tags
requires turning on managed mode which means that the host now has to
manage how MAC addresses are programmed into the switch, it's not rocket
science, but I don't have a good test framework to automate the testing
of those changes yet. If you are willing to help in the testing, I can
certainly give you patches to try.

> 
> I mean, will this be a breaking change in the future where config has to
> be done in a different way then?

When Broadcom tags are enabled the switch gets usable the way Andrew
expressed it, the only difference that makes on your configuration if
you want e.g: VLAN 101 to be for port 1-4 and VLAN 102 to be for port 5,
is that you no longer create an eth0.101 and eth0.102, but you create
br0.101 and br0.102.

> 
> Or will it be configurable via module parameters or /proc or /sys
> filesystem options?

We might be able to expose a sysfs attribute which shows the type of
tagging being enabled by a particular switch, that way scripts can
detect which variant: configuring the host controller or the bridge is
required. Would that be acceptable?
-- 
Florian

Re: semantics of rhashtable and sysvipc

2018-05-23 Thread Davidlohr Bueso


On Wed, 23 May 2018, Davidlohr Bueso wrote:

I see two possible fixes.


I guess a third option would be to make the hashtable static, but I'm not
against using rhashtables so I'm not really considering this.

Re: semantics of rhashtable and sysvipc

2018-05-23 Thread Davidlohr Bueso


On Wed, 23 May 2018, Davidlohr Bueso wrote:

I see two possible fixes.


I guess a third option would be to make the hashtable static, but I'm not
against using rhashtables so I'm not really considering this.

Re: [PATCH 3/8] md: raid5: use refcount_t for reference counting instead atomic_t

2018-05-23 Thread Peter Zijlstra

On Wed, May 23, 2018 at 06:21:19AM -0700, Matthew Wilcox wrote:
> On Wed, May 09, 2018 at 09:36:40PM +0200, Sebastian Andrzej Siewior wrote:
> > refcount_t type and corresponding API should be used instead of atomic_t 
> > when
> > the variable is used as a reference counter. This allows to avoid accidental
> > refcounter overflows that might lead to use-after-free situations.
> > 
> > Most changes are 1:1 replacements except for
> > BUG_ON(atomic_inc_return(>count) != 1);
> > 
> > which has been turned into
> > refcount_inc(>count);
> > BUG_ON(refcount_read(>count) != 1);
> 
> @@ -5387,7 +5387,8 @@ static struct stripe_head *__get_priority_stripe(struct
> +r5conf *conf, int group)
> sh->group = NULL;
> }
> list_del_init(>lru);
> -   BUG_ON(atomic_inc_return(>count) != 1);
> +   refcount_inc(>count);
> + BUG_ON(refcount_read(>count) != 1);
> return sh;
>  }
> 
> 
> That's the only problematic usage.  And I think what it's really saying is:
> 
>   BUG_ON(refcount_read(>count) != 0);
>   refcount_set(>count, 1);
> 
> With that, this looks like a reasonable use of refcount_t to me.

I'm not so sure, look at:

  r5c_do_reclaim():

if (!list_empty(>lru) &&
!test_bit(STRIPE_HANDLE, >state) &&
atomic_read(>count) == 0) {
  r5c_flush_stripe(cond, sh)

Which does:

  r5c_flush_stripe():

atomic_inc(>count);

Which is another inc-from-zero. Also, having sh's with count==0 in a
list is counter to the concept of refcounts and smells like usage-counts
to me. For refcount 0 really means deads and gone.

If this really is supposed to be a refcount, someone more familiar with
the raid5 should do the patch and write a comprehensive changelog on it.

Re: [PATCH 3/8] md: raid5: use refcount_t for reference counting instead atomic_t

2018-05-23 Thread Peter Zijlstra

On Wed, May 23, 2018 at 06:21:19AM -0700, Matthew Wilcox wrote:
> On Wed, May 09, 2018 at 09:36:40PM +0200, Sebastian Andrzej Siewior wrote:
> > refcount_t type and corresponding API should be used instead of atomic_t 
> > when
> > the variable is used as a reference counter. This allows to avoid accidental
> > refcounter overflows that might lead to use-after-free situations.
> > 
> > Most changes are 1:1 replacements except for
> > BUG_ON(atomic_inc_return(>count) != 1);
> > 
> > which has been turned into
> > refcount_inc(>count);
> > BUG_ON(refcount_read(>count) != 1);
> 
> @@ -5387,7 +5387,8 @@ static struct stripe_head *__get_priority_stripe(struct
> +r5conf *conf, int group)
> sh->group = NULL;
> }
> list_del_init(>lru);
> -   BUG_ON(atomic_inc_return(>count) != 1);
> +   refcount_inc(>count);
> + BUG_ON(refcount_read(>count) != 1);
> return sh;
>  }
> 
> 
> That's the only problematic usage.  And I think what it's really saying is:
> 
>   BUG_ON(refcount_read(>count) != 0);
>   refcount_set(>count, 1);
> 
> With that, this looks like a reasonable use of refcount_t to me.

I'm not so sure, look at:

  r5c_do_reclaim():

if (!list_empty(>lru) &&
!test_bit(STRIPE_HANDLE, >state) &&
atomic_read(>count) == 0) {
  r5c_flush_stripe(cond, sh)

Which does:

  r5c_flush_stripe():

atomic_inc(>count);

Which is another inc-from-zero. Also, having sh's with count==0 in a
list is counter to the concept of refcounts and smells like usage-counts
to me. For refcount 0 really means deads and gone.

If this really is supposed to be a refcount, someone more familiar with
the raid5 should do the patch and write a comprehensive changelog on it.

Re: B53 DSA switch problem on Banana Pi-R1 on Fedora 26

2018-05-23 Thread Florian Fainelli

On 05/23/2018 10:29 AM, Gerhard Wiesinger wrote:
> On 23.05.2018 17:50, Florian Fainelli wrote:
>>
>> On 05/23/2018 08:28 AM, Florian Fainelli wrote:
>>>
>>> On 05/22/2018 09:49 PM, Gerhard Wiesinger wrote:
 On 22.05.2018 22:42, Florian Fainelli wrote:
> On 05/22/2018 01:16 PM, Andrew Lunn wrote:
>>> Planned network structure will be as with 4.7.x kernels:
>>>
>>> br0 <=> eth0.101 <=> eth0 (vlan 101 tagged) <=> lan 1-lan4 (vlan 101
>>> untagged pvid)
>>>
>>> br1 <=> eth0.102 <=> eth0 (vlan 102 tagged) <=> wan (vlan 102
>>> untagged pvid)
>> Do you even need these vlans?
> Yes, remember, b53 does not currently turn on Broadcom tags, so the
> only
> way to segregate traffic is to have VLANs for that.
>
>> Are you doing this for port separation? To keep lan1-4 traffic
>> separate from wan? DSA does that by default, no vlan needed.
>>
>> So you can just do
>>
>> ip link add name br0 type bridge
>> ip link set dev br0 up
>> ip link set dev lan1 master br0
>> ip link set dev lan2 master br0
>> ip link set dev lan3 master br0
>> ip link set dev lan4 master br0
>>
>> and use interface wan directly, no bridge needed.
> That would work once Broadcom tags are turned on which requires
> turning
> on managed mode, which requires work that I have not been able to get
> done :)
 Setup with swconfig:

 #!/usr/bin/bash


 INTERFACE=eth0

 # Delete all IP addresses and get link up
 ip addr flush dev ${INTERFACE}
 ip link set ${INTERFACE} up

 # Lamobo R1 aka BPi R1 Routerboard
 #
 # Speaker | LAN1 | LAN2 | LAN3 | LAN4 || LAN5 | HDMI
 # SW-Port |  P2  |  P1  |  P0  |  P4  ||  P3  |
 # VLAN    |  11  |  12  |  13  |  14  ||ALL(t)|
 #
 # Switch-Port P8 - ALL(t) boards internal CPU Port

 # Setup switch
 swconfig dev ${INTERFACE} set reset 1
 swconfig dev ${INTERFACE} set enable_vlan 1
 swconfig dev ${INTERFACE} vlan 101 set ports '3 8t'
 swconfig dev ${INTERFACE} vlan 102 set ports '4 0 1 2 8t'
 swconfig dev ${INTERFACE} set apply 1

 How to achieve this setup CURRENTLY with DSA?
>>> Your first email had the right programming sequence, but you did not
>>> answer whether you have CONFIG_BRIDGE_VLAN_FILTERING enabled or not,
>>> which is likely your problem.
>> Here are some reference configurations that should work:
>>
>> https://github.com/armbian/build/issues/511#issuecomment-320473246
> 
> I know, some comments are from me but none of them worked, therefore on
> LKML :-)

I see, maybe you could have started there, that would have saved me a
trip to github to find out the thread.

> 
> /boot/config-4.16.7-100.fc26.armv7hl:CONFIG_BRIDGE_VLAN_FILTERING=y
> 
> so this can't be the issue, any further ideas?

Yes, remove the "self" from your bridge vlan commands, I don't see that
being necessary.

> 
> On my 2nd Banana Pi-R1 still on Fedora 25 with kernel
> 4.12.8-200.fc25.armv7hl the commands still work well, but I wanted to
> test the upgrade on another one.
> 
> /boot/config-4.12.8-200.fc25.armv7hl:CONFIG_BRIDGE_VLAN_FILTERING=y

Is using an upstream or compiled by yourself kernel an option at all? I
have no clue what is in a distribution kernel.
-- 
Florian

Re: B53 DSA switch problem on Banana Pi-R1 on Fedora 26

2018-05-23 Thread Florian Fainelli

On 05/23/2018 10:29 AM, Gerhard Wiesinger wrote:
> On 23.05.2018 17:50, Florian Fainelli wrote:
>>
>> On 05/23/2018 08:28 AM, Florian Fainelli wrote:
>>>
>>> On 05/22/2018 09:49 PM, Gerhard Wiesinger wrote:
 On 22.05.2018 22:42, Florian Fainelli wrote:
> On 05/22/2018 01:16 PM, Andrew Lunn wrote:
>>> Planned network structure will be as with 4.7.x kernels:
>>>
>>> br0 <=> eth0.101 <=> eth0 (vlan 101 tagged) <=> lan 1-lan4 (vlan 101
>>> untagged pvid)
>>>
>>> br1 <=> eth0.102 <=> eth0 (vlan 102 tagged) <=> wan (vlan 102
>>> untagged pvid)
>> Do you even need these vlans?
> Yes, remember, b53 does not currently turn on Broadcom tags, so the
> only
> way to segregate traffic is to have VLANs for that.
>
>> Are you doing this for port separation? To keep lan1-4 traffic
>> separate from wan? DSA does that by default, no vlan needed.
>>
>> So you can just do
>>
>> ip link add name br0 type bridge
>> ip link set dev br0 up
>> ip link set dev lan1 master br0
>> ip link set dev lan2 master br0
>> ip link set dev lan3 master br0
>> ip link set dev lan4 master br0
>>
>> and use interface wan directly, no bridge needed.
> That would work once Broadcom tags are turned on which requires
> turning
> on managed mode, which requires work that I have not been able to get
> done :)
 Setup with swconfig:

 #!/usr/bin/bash


 INTERFACE=eth0

 # Delete all IP addresses and get link up
 ip addr flush dev ${INTERFACE}
 ip link set ${INTERFACE} up

 # Lamobo R1 aka BPi R1 Routerboard
 #
 # Speaker | LAN1 | LAN2 | LAN3 | LAN4 || LAN5 | HDMI
 # SW-Port |  P2  |  P1  |  P0  |  P4  ||  P3  |
 # VLAN    |  11  |  12  |  13  |  14  ||ALL(t)|
 #
 # Switch-Port P8 - ALL(t) boards internal CPU Port

 # Setup switch
 swconfig dev ${INTERFACE} set reset 1
 swconfig dev ${INTERFACE} set enable_vlan 1
 swconfig dev ${INTERFACE} vlan 101 set ports '3 8t'
 swconfig dev ${INTERFACE} vlan 102 set ports '4 0 1 2 8t'
 swconfig dev ${INTERFACE} set apply 1

 How to achieve this setup CURRENTLY with DSA?
>>> Your first email had the right programming sequence, but you did not
>>> answer whether you have CONFIG_BRIDGE_VLAN_FILTERING enabled or not,
>>> which is likely your problem.
>> Here are some reference configurations that should work:
>>
>> https://github.com/armbian/build/issues/511#issuecomment-320473246
> 
> I know, some comments are from me but none of them worked, therefore on
> LKML :-)

I see, maybe you could have started there, that would have saved me a
trip to github to find out the thread.

> 
> /boot/config-4.16.7-100.fc26.armv7hl:CONFIG_BRIDGE_VLAN_FILTERING=y
> 
> so this can't be the issue, any further ideas?

Yes, remove the "self" from your bridge vlan commands, I don't see that
being necessary.

> 
> On my 2nd Banana Pi-R1 still on Fedora 25 with kernel
> 4.12.8-200.fc25.armv7hl the commands still work well, but I wanted to
> test the upgrade on another one.
> 
> /boot/config-4.12.8-200.fc25.armv7hl:CONFIG_BRIDGE_VLAN_FILTERING=y

Is using an upstream or compiled by yourself kernel an option at all? I
have no clue what is in a distribution kernel.
-- 
Florian

Re: [PATCH] arm64: kvm: use -fno-jump-tables with clang

2018-05-23 Thread Nick Desaulniers

On Wed, May 23, 2018 at 4:54 AM Andrey Konovalov 
wrote:
> On Tue, May 22, 2018 at 8:28 PM, Nick Desaulniers
>  wrote:
> > On Fri, May 18, 2018 at 11:13 AM Marc Zyngier 
wrote:
> >> > - you have checked that with a released version of the compiler, you
> >
> > On Tue, May 22, 2018 at 10:58 AM Andrey Konovalov  > wrote:
> >> Tested-by: Andrey Konovalov 
> >
> > Hi Andrey,
> > Thank you very much for this report.  Can you confirm as well the
version
> > of Clang that you were using?

> I'm on 86852a40 ("[InstCombine] Calloc-ed strings optimizations").

> > If it's not a binary release (built from
> > source), would you be able to re-confirm with a released version?

> Sure. Which release should I try and how do I get it?

Maybe clang-6.0 as the latest release (though I suspect you may run into
the recently-fixed-in-clang-7.0 "S" constraint bug that you reported).

I've had luck on debian based distributions installing from:
http://apt.llvm.org/

(These can be added to your /etc/apt/sources.list, then a `sudo apt update`
and `sudo apt install clang-6.0`)

If you're not able to add remote repositories (some employers block this ;)
), then you can find releases for download for a few different platforms:
https://releases.llvm.org/

For example, a quick:
$ mkdir llvm-6.0
$ cd !$
$ wget
https://releases.llvm.org/6.0.0/clang+llvm-6.0.0-x86_64-linux-gnu-debian8.tar.xz
$ tar xvf clang+llvm-6.0.0-x86_64-linux-gnu-debian8.tar.xz
$ ./clang+llvm-6.0.0-x86_64-linux-gnu-debian8/bin/clang-6.0 -v
clang version 6.0.0 (tags/RELEASE_600/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: .../llvm-6.0/./clang+llvm-6.0.0-x86_64-linux-gnu-debian8/bin
Found candidate GCC installation: ...
Candidate multilib: .;@m64
Selected multilib: .;@m64

Seems to work.
-- 
Thanks,
~Nick Desaulniers

Re: [PATCH] arm64: kvm: use -fno-jump-tables with clang

2018-05-23 Thread Nick Desaulniers

On Wed, May 23, 2018 at 4:54 AM Andrey Konovalov 
wrote:
> On Tue, May 22, 2018 at 8:28 PM, Nick Desaulniers
>  wrote:
> > On Fri, May 18, 2018 at 11:13 AM Marc Zyngier 
wrote:
> >> > - you have checked that with a released version of the compiler, you
> >
> > On Tue, May 22, 2018 at 10:58 AM Andrey Konovalov  > wrote:
> >> Tested-by: Andrey Konovalov 
> >
> > Hi Andrey,
> > Thank you very much for this report.  Can you confirm as well the
version
> > of Clang that you were using?

> I'm on 86852a40 ("[InstCombine] Calloc-ed strings optimizations").

> > If it's not a binary release (built from
> > source), would you be able to re-confirm with a released version?

> Sure. Which release should I try and how do I get it?

Maybe clang-6.0 as the latest release (though I suspect you may run into
the recently-fixed-in-clang-7.0 "S" constraint bug that you reported).

I've had luck on debian based distributions installing from:
http://apt.llvm.org/

(These can be added to your /etc/apt/sources.list, then a `sudo apt update`
and `sudo apt install clang-6.0`)

If you're not able to add remote repositories (some employers block this ;)
), then you can find releases for download for a few different platforms:
https://releases.llvm.org/

For example, a quick:
$ mkdir llvm-6.0
$ cd !$
$ wget
https://releases.llvm.org/6.0.0/clang+llvm-6.0.0-x86_64-linux-gnu-debian8.tar.xz
$ tar xvf clang+llvm-6.0.0-x86_64-linux-gnu-debian8.tar.xz
$ ./clang+llvm-6.0.0-x86_64-linux-gnu-debian8/bin/clang-6.0 -v
clang version 6.0.0 (tags/RELEASE_600/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: .../llvm-6.0/./clang+llvm-6.0.0-x86_64-linux-gnu-debian8/bin
Found candidate GCC installation: ...
Candidate multilib: .;@m64
Selected multilib: .;@m64

Seems to work.
-- 
Thanks,
~Nick Desaulniers

Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-23 Thread Nadav Amit

Dave Hansen  wrote:

> On 05/22/2018 10:51 AM, Matthew Wilcox wrote:
>> But CR3 is a per-CPU register.  So it'd be *possible* to allocate one
>> PGD per CPU (per process).  Have them be identical in all but one of
>> the PUD entries.  Then you've reserved 1/512 of your address space for
>> per-CPU pages.
>> 
>> Complicated, ugly, memory-consuming.  But possible.
> 
> Yep, and you'd probably want a cache of them so you don't end up having
> to go rewrite half of the PGD every time you context-switch.  But, on
> the plus side, the logic would be pretty similar if not identical to the
> way that we manage PCIDs.  If your mm was recently active on the CPU,
> you can use a PGD that's already been constructed.  If not, you're stuck
> making a new one.
> 
> Andy L. was alto talking about using this kind of mechanism to simplify
> the entry code.  Instead of needing per-cpu areas where we index by the
> CPU number, or by using %GS, we could have per-cpu data or code that has
> a fixed virtual address.
> 
> It'd be a fun project, but it might not ever pan out.

For the record: there are several academic studies about this subject. The
most notable one is Corey [1].

[1] 
https://www.usenix.org/legacy/event/osdi08/tech/full_papers/boyd-wickizer/boyd_wickizer.pdf

Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-23 Thread Nadav Amit

Dave Hansen  wrote:

> On 05/22/2018 10:51 AM, Matthew Wilcox wrote:
>> But CR3 is a per-CPU register.  So it'd be *possible* to allocate one
>> PGD per CPU (per process).  Have them be identical in all but one of
>> the PUD entries.  Then you've reserved 1/512 of your address space for
>> per-CPU pages.
>> 
>> Complicated, ugly, memory-consuming.  But possible.
> 
> Yep, and you'd probably want a cache of them so you don't end up having
> to go rewrite half of the PGD every time you context-switch.  But, on
> the plus side, the logic would be pretty similar if not identical to the
> way that we manage PCIDs.  If your mm was recently active on the CPU,
> you can use a PGD that's already been constructed.  If not, you're stuck
> making a new one.
> 
> Andy L. was alto talking about using this kind of mechanism to simplify
> the entry code.  Instead of needing per-cpu areas where we index by the
> CPU number, or by using %GS, we could have per-cpu data or code that has
> a fixed virtual address.
> 
> It'd be a fun project, but it might not ever pan out.

For the record: there are several academic studies about this subject. The
most notable one is Corey [1].

[1] 
https://www.usenix.org/legacy/event/osdi08/tech/full_papers/boyd-wickizer/boyd_wickizer.pdf

RE: [PATCH v11 1/2] cpufreq: Add Kryo CPU scaling driver

2018-05-23 Thread ilialin



> -Original Message-
> From: Sudeep Holla 
> Sent: Wednesday, May 23, 2018 16:25
> To: Ilia Lin ; vire...@kernel.org; n...@ti.com;
> sb...@kernel.org; r...@kernel.org; mark.rutl...@arm.com;
> r...@rjwysocki.net; linux...@vger.kernel.org; devicet...@vger.kernel.org;
> linux-kernel@vger.kernel.org
> Cc: Sudeep Holla 
> Subject: Re: [PATCH v11 1/2] cpufreq: Add Kryo CPU scaling driver
> 
> 
> 
> On 23/05/18 13:38, Ilia Lin wrote:
> > In Certain QCOM SoCs like apq8096 and msm8996 that have KRYO
> > processors, the CPU frequency subset and voltage value of each OPP
> > varies based on the silicon variant in use. Qualcomm Process Voltage
> > Scaling Tables defines the voltage and frequency value based on the
> > msm-id in SMEM and speedbin blown in the efuse combination.
> > The qcom-cpufreq-kryo driver reads the msm-id and efuse value from the
> > SoC to provide the OPP framework with required information.
> > This is used to determine the voltage and frequency value for each OPP
> > of
> > operating-points-v2 table when it is parsed by the OPP framework.
> >
> > Signed-off-by: Ilia Lin 
> > ---
> >  drivers/cpufreq/Kconfig.arm  |  10 ++
> >  drivers/cpufreq/Makefile |   1 +
> >  drivers/cpufreq/cpufreq-dt-platdev.c |   3 +
> >  drivers/cpufreq/qcom-cpufreq-kryo.c  | 181
> > +++
> >  4 files changed, 195 insertions(+)
> >  create mode 100644 drivers/cpufreq/qcom-cpufreq-kryo.c
> >
> > diff --git a/drivers/cpufreq/Kconfig.arm b/drivers/cpufreq/Kconfig.arm
> > index de55c7d..0bfd40e 100644
> > --- a/drivers/cpufreq/Kconfig.arm
> > +++ b/drivers/cpufreq/Kconfig.arm
> > @@ -124,6 +124,16 @@ config ARM_OMAP2PLUS_CPUFREQ
> > depends on ARCH_OMAP2PLUS
> > default ARCH_OMAP2PLUS
> >
> > +config ARM_QCOM_CPUFREQ_KRYO
> > +   bool "Qualcomm Kryo based CPUFreq"
> > +   depends on QCOM_QFPROM
> > +   depends on QCOM_SMEM
> > +   select PM_OPP
> > +   help
> > + This adds the CPUFreq driver for Qualcomm Kryo SoC based boards.
> > +
> > + If in doubt, say N.
> > +
> 
> Sorry but just noticed now, any reason why this can't be module. I can't
> imagine any.

I was asked previously to change this from tristate to bool.

> 
> [..]
> 
> > +static int qcom_cpufreq_kryo_probe(struct platform_device *pdev) {
> > +   struct opp_table *opp_tables[NR_CPUS] = {0};
> > +   struct platform_device *cpufreq_dt_pdev;
> > +   enum _msm8996_version msm8996_version;
> > +   struct nvmem_cell *speedbin_nvmem;
> > +   struct device_node *np;
> > +   struct device *cpu_dev;
> 
> [..]
> 
> > +
> > +   cpufreq_dt_pdev = platform_device_register_simple("cpufreq-dt", -
> 1, NULL, 0);
> > +   if (!IS_ERR(cpufreq_dt_pdev))
> > +   return 0;
> > +
> > +   ret = PTR_ERR(cpufreq_dt_pdev);
> > +   dev_err(cpu_dev, "Failed to register platform device\n");
> > +
> > +free_opp:
> > +   for_each_possible_cpu(cpu) {
> > +   if (IS_ERR_OR_NULL(opp_tables[cpu]))
> > +   break;
> > +   dev_pm_opp_put_supported_hw(opp_tables[cpu]);
> > +   }
> > +
> > +   return ret;
> > +}
> > +
> > +static int __init qcom_cpufreq_kryo_init(void) {
> > +   /*
> > +* Since the driver depends on smem and nvmem drivers, which may
> > +* return EPROBE_DEFER, all the real activity is done in the probe,
> > +* which may be defered as well. The init here is only registering
> > +* a platform device.
> > +*/
> > +   platform_device_register_simple("qcom-cpufreq-kryo", -1, NULL, 0);
> > +   return 0;
> > +}
> > +module_init(qcom_cpufreq_kryo_init);
> 
> Do you need this at all ? See below on how to eliminate the need for this.
> 
> > +
> > +static struct platform_driver qcom_cpufreq_kryo_driver = {
> > +   .probe = qcom_cpufreq_kryo_probe,
> > +   .driver = {
> > +   .name = "qcom-cpufreq-kryo",
> > +   },
> > +};
> > +builtin_platform_driver(qcom_cpufreq_kryo_driver);
> 
> Use builtin_platform_driver_probe and remove qcom_cpufreq_kryo_init or
> use module_platform_driver_probe if it can be module.
> 
> --
> Regards,
> Sudeep

RE: [PATCH v11 1/2] cpufreq: Add Kryo CPU scaling driver

2018-05-23 Thread ilialin



> -Original Message-
> From: Sudeep Holla 
> Sent: Wednesday, May 23, 2018 16:25
> To: Ilia Lin ; vire...@kernel.org; n...@ti.com;
> sb...@kernel.org; r...@kernel.org; mark.rutl...@arm.com;
> r...@rjwysocki.net; linux...@vger.kernel.org; devicet...@vger.kernel.org;
> linux-kernel@vger.kernel.org
> Cc: Sudeep Holla 
> Subject: Re: [PATCH v11 1/2] cpufreq: Add Kryo CPU scaling driver
> 
> 
> 
> On 23/05/18 13:38, Ilia Lin wrote:
> > In Certain QCOM SoCs like apq8096 and msm8996 that have KRYO
> > processors, the CPU frequency subset and voltage value of each OPP
> > varies based on the silicon variant in use. Qualcomm Process Voltage
> > Scaling Tables defines the voltage and frequency value based on the
> > msm-id in SMEM and speedbin blown in the efuse combination.
> > The qcom-cpufreq-kryo driver reads the msm-id and efuse value from the
> > SoC to provide the OPP framework with required information.
> > This is used to determine the voltage and frequency value for each OPP
> > of
> > operating-points-v2 table when it is parsed by the OPP framework.
> >
> > Signed-off-by: Ilia Lin 
> > ---
> >  drivers/cpufreq/Kconfig.arm  |  10 ++
> >  drivers/cpufreq/Makefile |   1 +
> >  drivers/cpufreq/cpufreq-dt-platdev.c |   3 +
> >  drivers/cpufreq/qcom-cpufreq-kryo.c  | 181
> > +++
> >  4 files changed, 195 insertions(+)
> >  create mode 100644 drivers/cpufreq/qcom-cpufreq-kryo.c
> >
> > diff --git a/drivers/cpufreq/Kconfig.arm b/drivers/cpufreq/Kconfig.arm
> > index de55c7d..0bfd40e 100644
> > --- a/drivers/cpufreq/Kconfig.arm
> > +++ b/drivers/cpufreq/Kconfig.arm
> > @@ -124,6 +124,16 @@ config ARM_OMAP2PLUS_CPUFREQ
> > depends on ARCH_OMAP2PLUS
> > default ARCH_OMAP2PLUS
> >
> > +config ARM_QCOM_CPUFREQ_KRYO
> > +   bool "Qualcomm Kryo based CPUFreq"
> > +   depends on QCOM_QFPROM
> > +   depends on QCOM_SMEM
> > +   select PM_OPP
> > +   help
> > + This adds the CPUFreq driver for Qualcomm Kryo SoC based boards.
> > +
> > + If in doubt, say N.
> > +
> 
> Sorry but just noticed now, any reason why this can't be module. I can't
> imagine any.

I was asked previously to change this from tristate to bool.

> 
> [..]
> 
> > +static int qcom_cpufreq_kryo_probe(struct platform_device *pdev) {
> > +   struct opp_table *opp_tables[NR_CPUS] = {0};
> > +   struct platform_device *cpufreq_dt_pdev;
> > +   enum _msm8996_version msm8996_version;
> > +   struct nvmem_cell *speedbin_nvmem;
> > +   struct device_node *np;
> > +   struct device *cpu_dev;
> 
> [..]
> 
> > +
> > +   cpufreq_dt_pdev = platform_device_register_simple("cpufreq-dt", -
> 1, NULL, 0);
> > +   if (!IS_ERR(cpufreq_dt_pdev))
> > +   return 0;
> > +
> > +   ret = PTR_ERR(cpufreq_dt_pdev);
> > +   dev_err(cpu_dev, "Failed to register platform device\n");
> > +
> > +free_opp:
> > +   for_each_possible_cpu(cpu) {
> > +   if (IS_ERR_OR_NULL(opp_tables[cpu]))
> > +   break;
> > +   dev_pm_opp_put_supported_hw(opp_tables[cpu]);
> > +   }
> > +
> > +   return ret;
> > +}
> > +
> > +static int __init qcom_cpufreq_kryo_init(void) {
> > +   /*
> > +* Since the driver depends on smem and nvmem drivers, which may
> > +* return EPROBE_DEFER, all the real activity is done in the probe,
> > +* which may be defered as well. The init here is only registering
> > +* a platform device.
> > +*/
> > +   platform_device_register_simple("qcom-cpufreq-kryo", -1, NULL, 0);
> > +   return 0;
> > +}
> > +module_init(qcom_cpufreq_kryo_init);
> 
> Do you need this at all ? See below on how to eliminate the need for this.
> 
> > +
> > +static struct platform_driver qcom_cpufreq_kryo_driver = {
> > +   .probe = qcom_cpufreq_kryo_probe,
> > +   .driver = {
> > +   .name = "qcom-cpufreq-kryo",
> > +   },
> > +};
> > +builtin_platform_driver(qcom_cpufreq_kryo_driver);
> 
> Use builtin_platform_driver_probe and remove qcom_cpufreq_kryo_init or
> use module_platform_driver_probe if it can be module.
> 
> --
> Regards,
> Sudeep

semantics of rhashtable and sysvipc

2018-05-23 Thread Davidlohr Bueso


Hi,

In sysvipc we have an ids->tables_initialized regarding the rhashtable,
introduced in 0cfb6aee70b (ipc: optimize semget/shmget/msgget for lots of keys).
It's there, specifically, to prevent nil pointer dereferences, from using an
uninitialized api. Considering how rhashtable_init() can fail (probably due to
ENOMEM, if anything), this made the overall ipc initialization capable of
failure as well. That alone is ugly, but fine, however I've spotted a few
issues regarding the semantics of tables_initialized (however unlikely they
may be):

- There is inconsistency in what we return to userspace: ipc_addid() returns
ENOSPC which is certainly _wrong_, while ipc_obtain_object_idr() returns
EINVAL.

- After we started using rhashtables, ipc_findkey() can return nil upon
!tables_initialized, but the caller expects nil for when the ipc structure
isn't found, and can therefore call into ipcget() callbacks.

I see two possible fixes. The first is to return the proper error code
if !tables_initialized, however, I'm not sure how we want to deal with the
EINVAL cases when rhashtable_init() fails. Userspace has no reason to know
about this. The ENOMEM case I guess makes sense ok.

The second alternative would be to add a BUG_ON() if the initialization fails
and we get rid of all the tables_initialized hack. I know Linus isn't fond
of this, and in the past ipc has gotten rid of BUG_ON usage in ipc because
of this. However I mention it because there are other core areas that do this
(or call panic() straightaway). Ie in networking: sock_diag_init() and
netlink_proto_init().

Thoughts?

Thanks,
Davidlohr

semantics of rhashtable and sysvipc

2018-05-23 Thread Davidlohr Bueso


Hi,

In sysvipc we have an ids->tables_initialized regarding the rhashtable,
introduced in 0cfb6aee70b (ipc: optimize semget/shmget/msgget for lots of keys).
It's there, specifically, to prevent nil pointer dereferences, from using an
uninitialized api. Considering how rhashtable_init() can fail (probably due to
ENOMEM, if anything), this made the overall ipc initialization capable of
failure as well. That alone is ugly, but fine, however I've spotted a few
issues regarding the semantics of tables_initialized (however unlikely they
may be):

- There is inconsistency in what we return to userspace: ipc_addid() returns
ENOSPC which is certainly _wrong_, while ipc_obtain_object_idr() returns
EINVAL.

- After we started using rhashtables, ipc_findkey() can return nil upon
!tables_initialized, but the caller expects nil for when the ipc structure
isn't found, and can therefore call into ipcget() callbacks.

I see two possible fixes. The first is to return the proper error code
if !tables_initialized, however, I'm not sure how we want to deal with the
EINVAL cases when rhashtable_init() fails. Userspace has no reason to know
about this. The ENOMEM case I guess makes sense ok.

The second alternative would be to add a BUG_ON() if the initialization fails
and we get rid of all the tables_initialized hack. I know Linus isn't fond
of this, and in the past ipc has gotten rid of BUG_ON usage in ipc because
of this. However I mention it because there are other core areas that do this
(or call panic() straightaway). Ie in networking: sock_diag_init() and
netlink_proto_init().

Thoughts?

Thanks,
Davidlohr

Re: [PATCH 07/11] dts: juno: Add scatter-gather support for all revisions

2018-05-23 Thread Mathieu Poirier

On 18 May 2018 at 10:39, Suzuki K Poulose  wrote:
> Advertise that the scatter-gather is properly integrated on
> all revisions of Juno board.
>
> Cc: Mathieu Poirier 
> Cc: Sudeep Holla 
> Cc: Liviu Dudau 
> Cc: Lorenzo Pieralisi 
> Signed-off-by: Suzuki K Poulose 

Reviewed-by: Mathieu Poirier 

> ---
>  arch/arm64/boot/dts/arm/juno-base.dtsi | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/arch/arm64/boot/dts/arm/juno-base.dtsi 
> b/arch/arm64/boot/dts/arm/juno-base.dtsi
> index eb749c5..6ce9090 100644
> --- a/arch/arm64/boot/dts/arm/juno-base.dtsi
> +++ b/arch/arm64/boot/dts/arm/juno-base.dtsi
> @@ -198,6 +198,7 @@
> clocks = <_smc50mhz>;
> clock-names = "apb_pclk";
> power-domains = <_devpd 0>;
> +   arm,scatter-gather;
> port {
> etr_in_port: endpoint {
> slave-mode;
> --
> 2.7.4
>

Re: [PATCH 07/11] dts: juno: Add scatter-gather support for all revisions

2018-05-23 Thread Mathieu Poirier

On 18 May 2018 at 10:39, Suzuki K Poulose  wrote:
> Advertise that the scatter-gather is properly integrated on
> all revisions of Juno board.
>
> Cc: Mathieu Poirier 
> Cc: Sudeep Holla 
> Cc: Liviu Dudau 
> Cc: Lorenzo Pieralisi 
> Signed-off-by: Suzuki K Poulose 

Reviewed-by: Mathieu Poirier 

> ---
>  arch/arm64/boot/dts/arm/juno-base.dtsi | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/arch/arm64/boot/dts/arm/juno-base.dtsi 
> b/arch/arm64/boot/dts/arm/juno-base.dtsi
> index eb749c5..6ce9090 100644
> --- a/arch/arm64/boot/dts/arm/juno-base.dtsi
> +++ b/arch/arm64/boot/dts/arm/juno-base.dtsi
> @@ -198,6 +198,7 @@
> clocks = <_smc50mhz>;
> clock-names = "apb_pclk";
> power-domains = <_devpd 0>;
> +   arm,scatter-gather;
> port {
> etr_in_port: endpoint {
> slave-mode;
> --
> 2.7.4
>

Re: [PATCH net V2 0/4] Fix several issues of virtio-net mergeable XDP

2018-05-23 Thread David Miller

From: Jason Wang 
Date: Tue, 22 May 2018 11:44:27 +0800

> Please review the patches that tries to fix sevreal issues of
> virtio-net mergeable XDP.
> 
> Changes from V1:
> - check against 1 before decreasing instead of resetting to 1
> - typoe fixes

Series applied and queued up for -stable.

Re: [PATCH net V2 0/4] Fix several issues of virtio-net mergeable XDP

2018-05-23 Thread David Miller

From: Jason Wang 
Date: Tue, 22 May 2018 11:44:27 +0800

> Please review the patches that tries to fix sevreal issues of
> virtio-net mergeable XDP.
> 
> Changes from V1:
> - check against 1 before decreasing instead of resetting to 1
> - typoe fixes

Series applied and queued up for -stable.

Re: B53 DSA switch problem on Banana Pi-R1 on Fedora 26

2018-05-23 Thread Gerhard Wiesinger


On 23.05.2018 17:28, Florian Fainelli wrote:



And in the future (time plan)?

If you don't care about multicast then you can use those patches:

https://github.com/ffainelli/linux/commit/de055bf5f34e9806463ab2793e0852f5dfc380df

and you have to change the part of drivers/net/dsa/b53/b53_common.c that
returns DSA_TAG_PROTO_NONE for 53125:


diff --git a/drivers/net/dsa/b53/b53_common.c
b/drivers/net/dsa/b53/b53_common.c
index 9f561fe505cb..3c64f026a8ce 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -1557,7 +1557,7 @@ enum dsa_tag_protocol b53_get_tag_protocol(struct
dsa_switch *ds, int port)
  * mode to be turned on which means we need to specifically
manage ARL
  * misses on multicast addresses (TBD).
  */
-   if (is5325(dev) || is5365(dev) || is539x(dev) || is531x5(dev) ||
+   if (is5325(dev) || is5365(dev) || is539x(dev) ||
 !b53_can_enable_brcm_tags(ds, port))
 return DSA_TAG_PROTO_NONE;


That would bring Broadcom tags to the 53125 switch and you would be able
to use the configuration lines from Andrew in that case.


What's the plan here regarding these 2 config option mode (how do you 
call them?)?


I mean, will this be a breaking change in the future where config has to 
be done in a different way then?


Or will it be configurable via module parameters or /proc or /sys 
filesystem options?



Thank you.

Ciao,

Gerhard

Re: B53 DSA switch problem on Banana Pi-R1 on Fedora 26

2018-05-23 Thread Gerhard Wiesinger


On 23.05.2018 17:28, Florian Fainelli wrote:



And in the future (time plan)?

If you don't care about multicast then you can use those patches:

https://github.com/ffainelli/linux/commit/de055bf5f34e9806463ab2793e0852f5dfc380df

and you have to change the part of drivers/net/dsa/b53/b53_common.c that
returns DSA_TAG_PROTO_NONE for 53125:


diff --git a/drivers/net/dsa/b53/b53_common.c
b/drivers/net/dsa/b53/b53_common.c
index 9f561fe505cb..3c64f026a8ce 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -1557,7 +1557,7 @@ enum dsa_tag_protocol b53_get_tag_protocol(struct
dsa_switch *ds, int port)
  * mode to be turned on which means we need to specifically
manage ARL
  * misses on multicast addresses (TBD).
  */
-   if (is5325(dev) || is5365(dev) || is539x(dev) || is531x5(dev) ||
+   if (is5325(dev) || is5365(dev) || is539x(dev) ||
 !b53_can_enable_brcm_tags(ds, port))
 return DSA_TAG_PROTO_NONE;


That would bring Broadcom tags to the 53125 switch and you would be able
to use the configuration lines from Andrew in that case.


What's the plan here regarding these 2 config option mode (how do you 
call them?)?


I mean, will this be a breaking change in the future where config has to 
be done in a different way then?


Or will it be configurable via module parameters or /proc or /sys 
filesystem options?



Thank you.

Ciao,

Gerhard

Re: [PATCH v8 4/6] cpuset: Make generate_sched_domains() recognize isolated_cpus

2018-05-23 Thread Patrick Bellasi

Hi Waiman,

On 17-May 16:55, Waiman Long wrote:

[...]

> @@ -672,13 +672,14 @@ static int generate_sched_domains(cpumask_var_t 
> **domains,
>   int ndoms = 0;  /* number of sched domains in result */
>   int nslot;  /* next empty doms[] struct cpumask slot */
>   struct cgroup_subsys_state *pos_css;
> + bool root_load_balance = is_sched_load_balance(_cpuset);
>  
>   doms = NULL;
>   dattr = NULL;
>   csa = NULL;
>  
>   /* Special case for the 99% of systems with one, full, sched domain */
> - if (is_sched_load_balance(_cpuset)) {
> + if (root_load_balance && !top_cpuset.isolation_count) {

Perhaps I'm missing something but, it seems to me that, when the two
conditions above are true, then we are going to destroy and rebuild
the exact same scheduling domains.

IOW, on 99% of systems where:

   is_sched_load_balance(_cpuset)
   top_cpuset.isolation_count = 0

since boot time and forever, then every time we update a value for
cpuset.cpus we keep rebuilding the same SDs.

It's not strictly related to this patch, the same already happens in
mainline based just on the first condition, but since you are extending
that optimization, perhaps you can tell me where I'm possibly wrong or
which cases I'm not considering.

I'm interested mainly because on Android systems those conditions
are always true and we see SDs rebuilds every time we write
something in cpuset.cpus, which ultimately accounts for almost all the
6-7[ms] time required for the write to return, depending on the CPU
frequency.

Cheers Patrick

-- 
#include 

Patrick Bellasi

Re: [PATCH v8 4/6] cpuset: Make generate_sched_domains() recognize isolated_cpus

2018-05-23 Thread Patrick Bellasi

Hi Waiman,

On 17-May 16:55, Waiman Long wrote:

[...]

> @@ -672,13 +672,14 @@ static int generate_sched_domains(cpumask_var_t 
> **domains,
>   int ndoms = 0;  /* number of sched domains in result */
>   int nslot;  /* next empty doms[] struct cpumask slot */
>   struct cgroup_subsys_state *pos_css;
> + bool root_load_balance = is_sched_load_balance(_cpuset);
>  
>   doms = NULL;
>   dattr = NULL;
>   csa = NULL;
>  
>   /* Special case for the 99% of systems with one, full, sched domain */
> - if (is_sched_load_balance(_cpuset)) {
> + if (root_load_balance && !top_cpuset.isolation_count) {

Perhaps I'm missing something but, it seems to me that, when the two
conditions above are true, then we are going to destroy and rebuild
the exact same scheduling domains.

IOW, on 99% of systems where:

   is_sched_load_balance(_cpuset)
   top_cpuset.isolation_count = 0

since boot time and forever, then every time we update a value for
cpuset.cpus we keep rebuilding the same SDs.

It's not strictly related to this patch, the same already happens in
mainline based just on the first condition, but since you are extending
that optimization, perhaps you can tell me where I'm possibly wrong or
which cases I'm not considering.

I'm interested mainly because on Android systems those conditions
are always true and we see SDs rebuilds every time we write
something in cpuset.cpus, which ultimately accounts for almost all the
6-7[ms] time required for the write to return, depending on the CPU
frequency.

Cheers Patrick

-- 
#include 

Patrick Bellasi

Re: [PATCH v3 net-next 0/2] bpfilter

2018-05-23 Thread Greg KH

On Wed, May 23, 2018 at 01:26:48PM -0400, David Miller wrote:
> From: Alexei Starovoitov 
> Date: Mon, 21 May 2018 19:22:28 -0700
> 
> > v2->v3:
> > - followed Luis's suggestion and significantly simplied first patch
> >   with shmem_kernel_file_setup+kernel_write. Added kdoc for new helper
> > - fixed typos and race to access pipes with mutex
> > - tested with bpfilter being 'builtin'. CONFIG_BPFILTER_UMH=y|m both work.
> >   Interesting to see a usermode executable being embedded inside vmlinux.
> > - it doesn't hurt to enable bpfilter in .config.
> >   ip_setsockopt commands sent to usermode via pipes and -ENOPROTOOPT is
> >   returned from userspace, so kernel falls back to original iptables code
> > 
> > v1->v2:
> > this patch set is almost a full rewrite of the earlier umh modules approach
> > The v1 of patches and follow up discussion was covered by LWN:
> > https://lwn.net/Articles/749108/
> > 
> > I believe the v2 addresses all issues brought up by Andy and others.
> > Mainly there are zero changes to kernel/module.c
> > Instead of teaching module loading logic to recognize special
> > umh module, let normal kernel modules execute part of its own
> > .init.rodata as a new user space process (Andy's idea)
> > Patch 1 introduces this new helper:
> > int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
> > Input:
> >   data + len == executable file
> > Output:
> >   struct umh_info {
> >struct file *pipe_to_umh;
> >struct file *pipe_from_umh;
> >pid_t pid;
> >   };
> 
> Series applied, let the madness begin... :-)

Yeah, this is going to be fun :)

Re: [PATCH v3 net-next 0/2] bpfilter

2018-05-23 Thread Greg KH

On Wed, May 23, 2018 at 01:26:48PM -0400, David Miller wrote:
> From: Alexei Starovoitov 
> Date: Mon, 21 May 2018 19:22:28 -0700
> 
> > v2->v3:
> > - followed Luis's suggestion and significantly simplied first patch
> >   with shmem_kernel_file_setup+kernel_write. Added kdoc for new helper
> > - fixed typos and race to access pipes with mutex
> > - tested with bpfilter being 'builtin'. CONFIG_BPFILTER_UMH=y|m both work.
> >   Interesting to see a usermode executable being embedded inside vmlinux.
> > - it doesn't hurt to enable bpfilter in .config.
> >   ip_setsockopt commands sent to usermode via pipes and -ENOPROTOOPT is
> >   returned from userspace, so kernel falls back to original iptables code
> > 
> > v1->v2:
> > this patch set is almost a full rewrite of the earlier umh modules approach
> > The v1 of patches and follow up discussion was covered by LWN:
> > https://lwn.net/Articles/749108/
> > 
> > I believe the v2 addresses all issues brought up by Andy and others.
> > Mainly there are zero changes to kernel/module.c
> > Instead of teaching module loading logic to recognize special
> > umh module, let normal kernel modules execute part of its own
> > .init.rodata as a new user space process (Andy's idea)
> > Patch 1 introduces this new helper:
> > int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
> > Input:
> >   data + len == executable file
> > Output:
> >   struct umh_info {
> >struct file *pipe_to_umh;
> >struct file *pipe_from_umh;
> >pid_t pid;
> >   };
> 
> Series applied, let the madness begin... :-)

Yeah, this is going to be fun :)

Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-23 Thread Dave Hansen

On 05/22/2018 10:51 AM, Matthew Wilcox wrote:
> But CR3 is a per-CPU register.  So it'd be *possible* to allocate one
> PGD per CPU (per process).  Have them be identical in all but one of
> the PUD entries.  Then you've reserved 1/512 of your address space for
> per-CPU pages.
> 
> Complicated, ugly, memory-consuming.  But possible.

Yep, and you'd probably want a cache of them so you don't end up having
to go rewrite half of the PGD every time you context-switch.  But, on
the plus side, the logic would be pretty similar if not identical to the
way that we manage PCIDs.  If your mm was recently active on the CPU,
you can use a PGD that's already been constructed.  If not, you're stuck
making a new one.

Andy L. was alto talking about using this kind of mechanism to simplify
the entry code.  Instead of needing per-cpu areas where we index by the
CPU number, or by using %GS, we could have per-cpu data or code that has
a fixed virtual address.

It'd be a fun project, but it might not ever pan out.

Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-23 Thread Dave Hansen

On 05/22/2018 10:51 AM, Matthew Wilcox wrote:
> But CR3 is a per-CPU register.  So it'd be *possible* to allocate one
> PGD per CPU (per process).  Have them be identical in all but one of
> the PUD entries.  Then you've reserved 1/512 of your address space for
> per-CPU pages.
> 
> Complicated, ugly, memory-consuming.  But possible.

Yep, and you'd probably want a cache of them so you don't end up having
to go rewrite half of the PGD every time you context-switch.  But, on
the plus side, the logic would be pretty similar if not identical to the
way that we manage PCIDs.  If your mm was recently active on the CPU,
you can use a PGD that's already been constructed.  If not, you're stuck
making a new one.

Andy L. was alto talking about using this kind of mechanism to simplify
the entry code.  Instead of needing per-cpu areas where we index by the
CPU number, or by using %GS, we could have per-cpu data or code that has
a fixed virtual address.

It'd be a fun project, but it might not ever pan out.

Re: B53 DSA switch problem on Banana Pi-R1 on Fedora 26

2018-05-23 Thread Gerhard Wiesinger


On 23.05.2018 17:50, Florian Fainelli wrote:


On 05/23/2018 08:28 AM, Florian Fainelli wrote:


On 05/22/2018 09:49 PM, Gerhard Wiesinger wrote:

On 22.05.2018 22:42, Florian Fainelli wrote:

On 05/22/2018 01:16 PM, Andrew Lunn wrote:

Planned network structure will be as with 4.7.x kernels:

br0 <=> eth0.101 <=> eth0 (vlan 101 tagged) <=> lan 1-lan4 (vlan 101
untagged pvid)

br1 <=> eth0.102 <=> eth0 (vlan 102 tagged) <=> wan (vlan 102
untagged pvid)

Do you even need these vlans?

Yes, remember, b53 does not currently turn on Broadcom tags, so the only
way to segregate traffic is to have VLANs for that.


Are you doing this for port separation? To keep lan1-4 traffic
separate from wan? DSA does that by default, no vlan needed.

So you can just do

ip link add name br0 type bridge
ip link set dev br0 up
ip link set dev lan1 master br0
ip link set dev lan2 master br0
ip link set dev lan3 master br0
ip link set dev lan4 master br0

and use interface wan directly, no bridge needed.

That would work once Broadcom tags are turned on which requires turning
on managed mode, which requires work that I have not been able to get
done :)

Setup with swconfig:

#!/usr/bin/bash


INTERFACE=eth0

# Delete all IP addresses and get link up
ip addr flush dev ${INTERFACE}
ip link set ${INTERFACE} up

# Lamobo R1 aka BPi R1 Routerboard
#
# Speaker | LAN1 | LAN2 | LAN3 | LAN4 || LAN5 | HDMI
# SW-Port |  P2  |  P1  |  P0  |  P4  ||  P3  |
# VLAN    |  11  |  12  |  13  |  14  ||ALL(t)|
#
# Switch-Port P8 - ALL(t) boards internal CPU Port

# Setup switch
swconfig dev ${INTERFACE} set reset 1
swconfig dev ${INTERFACE} set enable_vlan 1
swconfig dev ${INTERFACE} vlan 101 set ports '3 8t'
swconfig dev ${INTERFACE} vlan 102 set ports '4 0 1 2 8t'
swconfig dev ${INTERFACE} set apply 1

How to achieve this setup CURRENTLY with DSA?

Your first email had the right programming sequence, but you did not
answer whether you have CONFIG_BRIDGE_VLAN_FILTERING enabled or not,
which is likely your problem.

Here are some reference configurations that should work:

https://github.com/armbian/build/issues/511#issuecomment-320473246


I know, some comments are from me but none of them worked, therefore on 
LKML :-)


/boot/config-4.16.7-100.fc26.armv7hl:CONFIG_BRIDGE_VLAN_FILTERING=y

so this can't be the issue, any further ideas?

On my 2nd Banana Pi-R1 still on Fedora 25 with kernel 
4.12.8-200.fc25.armv7hl the commands still work well, but I wanted to 
test the upgrade on another one.


/boot/config-4.12.8-200.fc25.armv7hl:CONFIG_BRIDGE_VLAN_FILTERING=y

Thnx.

Ciao,

Gerhard

Re: B53 DSA switch problem on Banana Pi-R1 on Fedora 26

2018-05-23 Thread Gerhard Wiesinger


On 23.05.2018 17:50, Florian Fainelli wrote:


On 05/23/2018 08:28 AM, Florian Fainelli wrote:


On 05/22/2018 09:49 PM, Gerhard Wiesinger wrote:

On 22.05.2018 22:42, Florian Fainelli wrote:

On 05/22/2018 01:16 PM, Andrew Lunn wrote:

Planned network structure will be as with 4.7.x kernels:

br0 <=> eth0.101 <=> eth0 (vlan 101 tagged) <=> lan 1-lan4 (vlan 101
untagged pvid)

br1 <=> eth0.102 <=> eth0 (vlan 102 tagged) <=> wan (vlan 102
untagged pvid)

Do you even need these vlans?

Yes, remember, b53 does not currently turn on Broadcom tags, so the only
way to segregate traffic is to have VLANs for that.


Are you doing this for port separation? To keep lan1-4 traffic
separate from wan? DSA does that by default, no vlan needed.

So you can just do

ip link add name br0 type bridge
ip link set dev br0 up
ip link set dev lan1 master br0
ip link set dev lan2 master br0
ip link set dev lan3 master br0
ip link set dev lan4 master br0

and use interface wan directly, no bridge needed.

That would work once Broadcom tags are turned on which requires turning
on managed mode, which requires work that I have not been able to get
done :)

Setup with swconfig:

#!/usr/bin/bash


INTERFACE=eth0

# Delete all IP addresses and get link up
ip addr flush dev ${INTERFACE}
ip link set ${INTERFACE} up

# Lamobo R1 aka BPi R1 Routerboard
#
# Speaker | LAN1 | LAN2 | LAN3 | LAN4 || LAN5 | HDMI
# SW-Port |  P2  |  P1  |  P0  |  P4  ||  P3  |
# VLAN    |  11  |  12  |  13  |  14  ||ALL(t)|
#
# Switch-Port P8 - ALL(t) boards internal CPU Port

# Setup switch
swconfig dev ${INTERFACE} set reset 1
swconfig dev ${INTERFACE} set enable_vlan 1
swconfig dev ${INTERFACE} vlan 101 set ports '3 8t'
swconfig dev ${INTERFACE} vlan 102 set ports '4 0 1 2 8t'
swconfig dev ${INTERFACE} set apply 1

How to achieve this setup CURRENTLY with DSA?

Your first email had the right programming sequence, but you did not
answer whether you have CONFIG_BRIDGE_VLAN_FILTERING enabled or not,
which is likely your problem.

Here are some reference configurations that should work:

https://github.com/armbian/build/issues/511#issuecomment-320473246


I know, some comments are from me but none of them worked, therefore on 
LKML :-)


/boot/config-4.16.7-100.fc26.armv7hl:CONFIG_BRIDGE_VLAN_FILTERING=y

so this can't be the issue, any further ideas?

On my 2nd Banana Pi-R1 still on Fedora 25 with kernel 
4.12.8-200.fc25.armv7hl the commands still work well, but I wanted to 
test the upgrade on another one.


/boot/config-4.12.8-200.fc25.armv7hl:CONFIG_BRIDGE_VLAN_FILTERING=y

Thnx.

Ciao,

Gerhard

Re: [PATCH] sched,tracing: Correct trace_sched_pi_setprio() for deboosting

2018-05-23 Thread Peter Zijlstra

On Wed, May 23, 2018 at 04:11:07PM +0200, Sebastian Andrzej Siewior wrote:

> Since that commit I see during a deboost a task this:
> |futex sched_pi_setprio: comm=futex_requeue_p pid=2234 oldprio=98 newprio=98
> |futex sched_switch: prev_comm=futex_requeue_p prev_pid=2234 prev_prio=120
> 
> and after the revert, the `newprio' shows the correct value again:
> 
> |futex sched_pi_setprio: comm=futex_requeue_p pid=2220 oldprio=98 newprio=120
> |futex sched_switch: prev_comm=futex_requeue_p prev_pid=2220 prev_prio=120

> @@ -435,7 +435,7 @@ TRACE_EVENT(sched_pi_setprio,
>   memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
>   __entry->pid= tsk->pid;
>   __entry->oldprio= tsk->prio;
> - __entry->newprio= pi_task ? pi_task->prio : tsk->prio;
> + __entry->newprio= new_prio;
>   /* XXX SCHED_DEADLINE bits missing */
>   ),
>  
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 092f7c4de903..888df643b99b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3823,7 +3823,7 @@ void rt_mutex_setprio(struct task_struct *p, struct 
> task_struct *pi_task)
>   goto out_unlock;
>   }
>  
> - trace_sched_pi_setprio(p, pi_task);
> + trace_sched_pi_setprio(p, prio);

at this point:

prio = pi_task ? min(p->normal_prio, pi->task->prio) : p->normal_prio;

(aka __rt_effective_prio)

Should we put that in the tracepoint instead?

>   oldprio = p->prio;
>  
>   if (oldprio == prio)

Re: [PATCH] sched,tracing: Correct trace_sched_pi_setprio() for deboosting

2018-05-23 Thread Peter Zijlstra

On Wed, May 23, 2018 at 04:11:07PM +0200, Sebastian Andrzej Siewior wrote:

> Since that commit I see during a deboost a task this:
> |futex sched_pi_setprio: comm=futex_requeue_p pid=2234 oldprio=98 newprio=98
> |futex sched_switch: prev_comm=futex_requeue_p prev_pid=2234 prev_prio=120
> 
> and after the revert, the `newprio' shows the correct value again:
> 
> |futex sched_pi_setprio: comm=futex_requeue_p pid=2220 oldprio=98 newprio=120
> |futex sched_switch: prev_comm=futex_requeue_p prev_pid=2220 prev_prio=120

> @@ -435,7 +435,7 @@ TRACE_EVENT(sched_pi_setprio,
>   memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
>   __entry->pid= tsk->pid;
>   __entry->oldprio= tsk->prio;
> - __entry->newprio= pi_task ? pi_task->prio : tsk->prio;
> + __entry->newprio= new_prio;
>   /* XXX SCHED_DEADLINE bits missing */
>   ),
>  
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 092f7c4de903..888df643b99b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3823,7 +3823,7 @@ void rt_mutex_setprio(struct task_struct *p, struct 
> task_struct *pi_task)
>   goto out_unlock;
>   }
>  
> - trace_sched_pi_setprio(p, pi_task);
> + trace_sched_pi_setprio(p, prio);

at this point:

prio = pi_task ? min(p->normal_prio, pi->task->prio) : p->normal_prio;

(aka __rt_effective_prio)

Should we put that in the tracepoint instead?

>   oldprio = p->prio;
>  
>   if (oldprio == prio)

Re: [PATCH V4 34/38] x86/intel_rdt: Create debugfs files for pseudo-locking testing

2018-05-23 Thread Greg KH

On Wed, May 23, 2018 at 10:19:41AM -0700, Reinette Chatre wrote:
> Hi Greg,
> 
> On 5/23/2018 1:05 AM, Greg KH wrote:
> > On Tue, May 22, 2018 at 02:02:37PM -0700, Reinette Chatre wrote:
> >> On 5/22/2018 12:43 PM, Greg KH wrote:
> >>> On Tue, May 22, 2018 at 04:29:22AM -0700, Reinette Chatre wrote:
>  +ret = strtobool(buf, );
>  +if (ret == 0 && bv) {
>  +ret = debugfs_file_get(file->f_path.dentry);
>  +if (unlikely(ret))
>  +return ret;
> >>>
> >>> Only ever use unlikely/likely if you can measure the performance
> >>> difference.  Hint, you can't do that here, it's not needed at all.
> >>
> >> Here my intention was to follow the current best practices and in the
> >> kernel source I am working with eight of the ten usages of
> >> debugfs_file_get() is followed by an unlikely(). My assumption was thus
> >> that this is a best practice. Thanks for catching this - I'll change it.
> > 
> > Really?  That's some horrible examples, any pointers to them?  I think I
> > need to do a massive sweep of the kernel tree and fix up all of this
> > crud so that people don't keep cut/paste the same bad code everywhere.
> 
> As you know debugfs_file_get() is a recent addition to the kernel,
> introduced in:
> commit e9117a5a4bf65d8e99f060d356a04d27a60b436d
> Author: Nicolai Stange 
> Date:   Tue Oct 31 00:15:48 2017 +0100
> 
> debugfs: implement per-file removal protection
> 
> Following this introduction, the same author modified the now obsolete
> calls of debugfs_use_file_start() to debugfs_file_get() in commits:
> 
> commit 7cda7b8f97da9382bb945d541a85cde58d5dac27
> Author: Nicolai Stange 
> Date:   Tue Oct 31 00:15:51 2017 +0100
> 
> IB/hfi1: convert to debugfs_file_get() and -put()
> 
> 
> commit 69d29f9e6a53559895e6f785f6cf72daa738f132
> Author: Nicolai Stange 
> Date:   Tue Oct 31 00:15:50 2017 +0100
> 
> debugfs: convert to debugfs_file_get() and -put()
> 
> 
> In the above two commits the usage of the new debugfs_file_get()
> primarily follows the pattern of:
> r = debugfs_file_get(d);
> if (unlikely(r))
> 
> Since the author of the new interface used the pattern above in the
> conversions I do not think it is unreasonable to find other developers
> following suit believing that it is a best practice.

Ah, that's where that pattern came from, thanks for finding it.  It was
a conversion of the "old" api in the IB code that was using likely(),
which in a way, did make sense to use (due to the way processors assume
0 is "true")

I'll work on cleaning all of these up on my next long plane ride, should
give me something to do :)

thanks,

greg k-h

Re: [PATCH V4 34/38] x86/intel_rdt: Create debugfs files for pseudo-locking testing

2018-05-23 Thread Greg KH

On Wed, May 23, 2018 at 10:19:41AM -0700, Reinette Chatre wrote:
> Hi Greg,
> 
> On 5/23/2018 1:05 AM, Greg KH wrote:
> > On Tue, May 22, 2018 at 02:02:37PM -0700, Reinette Chatre wrote:
> >> On 5/22/2018 12:43 PM, Greg KH wrote:
> >>> On Tue, May 22, 2018 at 04:29:22AM -0700, Reinette Chatre wrote:
>  +ret = strtobool(buf, );
>  +if (ret == 0 && bv) {
>  +ret = debugfs_file_get(file->f_path.dentry);
>  +if (unlikely(ret))
>  +return ret;
> >>>
> >>> Only ever use unlikely/likely if you can measure the performance
> >>> difference.  Hint, you can't do that here, it's not needed at all.
> >>
> >> Here my intention was to follow the current best practices and in the
> >> kernel source I am working with eight of the ten usages of
> >> debugfs_file_get() is followed by an unlikely(). My assumption was thus
> >> that this is a best practice. Thanks for catching this - I'll change it.
> > 
> > Really?  That's some horrible examples, any pointers to them?  I think I
> > need to do a massive sweep of the kernel tree and fix up all of this
> > crud so that people don't keep cut/paste the same bad code everywhere.
> 
> As you know debugfs_file_get() is a recent addition to the kernel,
> introduced in:
> commit e9117a5a4bf65d8e99f060d356a04d27a60b436d
> Author: Nicolai Stange 
> Date:   Tue Oct 31 00:15:48 2017 +0100
> 
> debugfs: implement per-file removal protection
> 
> Following this introduction, the same author modified the now obsolete
> calls of debugfs_use_file_start() to debugfs_file_get() in commits:
> 
> commit 7cda7b8f97da9382bb945d541a85cde58d5dac27
> Author: Nicolai Stange 
> Date:   Tue Oct 31 00:15:51 2017 +0100
> 
> IB/hfi1: convert to debugfs_file_get() and -put()
> 
> 
> commit 69d29f9e6a53559895e6f785f6cf72daa738f132
> Author: Nicolai Stange 
> Date:   Tue Oct 31 00:15:50 2017 +0100
> 
> debugfs: convert to debugfs_file_get() and -put()
> 
> 
> In the above two commits the usage of the new debugfs_file_get()
> primarily follows the pattern of:
> r = debugfs_file_get(d);
> if (unlikely(r))
> 
> Since the author of the new interface used the pattern above in the
> conversions I do not think it is unreasonable to find other developers
> following suit believing that it is a best practice.

Ah, that's where that pattern came from, thanks for finding it.  It was
a conversion of the "old" api in the IB code that was using likely(),
which in a way, did make sense to use (due to the way processors assume
0 is "true")

I'll work on cleaning all of these up on my next long plane ride, should
give me something to do :)

thanks,

greg k-h

Re: [PATCH v3 net-next 0/2] bpfilter

2018-05-23 Thread David Miller

From: Alexei Starovoitov 
Date: Mon, 21 May 2018 19:22:28 -0700

> v2->v3:
> - followed Luis's suggestion and significantly simplied first patch
>   with shmem_kernel_file_setup+kernel_write. Added kdoc for new helper
> - fixed typos and race to access pipes with mutex
> - tested with bpfilter being 'builtin'. CONFIG_BPFILTER_UMH=y|m both work.
>   Interesting to see a usermode executable being embedded inside vmlinux.
> - it doesn't hurt to enable bpfilter in .config.
>   ip_setsockopt commands sent to usermode via pipes and -ENOPROTOOPT is
>   returned from userspace, so kernel falls back to original iptables code
> 
> v1->v2:
> this patch set is almost a full rewrite of the earlier umh modules approach
> The v1 of patches and follow up discussion was covered by LWN:
> https://lwn.net/Articles/749108/
> 
> I believe the v2 addresses all issues brought up by Andy and others.
> Mainly there are zero changes to kernel/module.c
> Instead of teaching module loading logic to recognize special
> umh module, let normal kernel modules execute part of its own
> .init.rodata as a new user space process (Andy's idea)
> Patch 1 introduces this new helper:
> int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
> Input:
>   data + len == executable file
> Output:
>   struct umh_info {
>struct file *pipe_to_umh;
>struct file *pipe_from_umh;
>pid_t pid;
>   };

Series applied, let the madness begin... :-)

Re: [PATCH v3 net-next 0/2] bpfilter

2018-05-23 Thread David Miller

From: Alexei Starovoitov 
Date: Mon, 21 May 2018 19:22:28 -0700

> v2->v3:
> - followed Luis's suggestion and significantly simplied first patch
>   with shmem_kernel_file_setup+kernel_write. Added kdoc for new helper
> - fixed typos and race to access pipes with mutex
> - tested with bpfilter being 'builtin'. CONFIG_BPFILTER_UMH=y|m both work.
>   Interesting to see a usermode executable being embedded inside vmlinux.
> - it doesn't hurt to enable bpfilter in .config.
>   ip_setsockopt commands sent to usermode via pipes and -ENOPROTOOPT is
>   returned from userspace, so kernel falls back to original iptables code
> 
> v1->v2:
> this patch set is almost a full rewrite of the earlier umh modules approach
> The v1 of patches and follow up discussion was covered by LWN:
> https://lwn.net/Articles/749108/
> 
> I believe the v2 addresses all issues brought up by Andy and others.
> Mainly there are zero changes to kernel/module.c
> Instead of teaching module loading logic to recognize special
> umh module, let normal kernel modules execute part of its own
> .init.rodata as a new user space process (Andy's idea)
> Patch 1 introduces this new helper:
> int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
> Input:
>   data + len == executable file
> Output:
>   struct umh_info {
>struct file *pipe_to_umh;
>struct file *pipe_from_umh;
>pid_t pid;
>   };

Series applied, let the madness begin... :-)

Re: [PATCH 6/6] Drop flex_arrays

2018-05-23 Thread Dave Hansen

Thanks for the heads-up, Matthew!

On 05/22/2018 06:18 PM, Kent Overstreet wrote:
> All existing users have been converted to generic radix trees

Well, flex_arrays had a good run, and the new radix trees do look quite
a bit nicer.

Feel free to add my ack.

Re: [PATCH 6/6] Drop flex_arrays

2018-05-23 Thread Dave Hansen

Thanks for the heads-up, Matthew!

On 05/22/2018 06:18 PM, Kent Overstreet wrote:
> All existing users have been converted to generic radix trees

Well, flex_arrays had a good run, and the new radix trees do look quite
a bit nicer.

Feel free to add my ack.

Re: [PATCH V4 34/38] x86/intel_rdt: Create debugfs files for pseudo-locking testing

2018-05-23 Thread Reinette Chatre

Hi Greg,

On 5/23/2018 1:05 AM, Greg KH wrote:
> On Tue, May 22, 2018 at 02:02:37PM -0700, Reinette Chatre wrote:
>> On 5/22/2018 12:43 PM, Greg KH wrote:
>>> On Tue, May 22, 2018 at 04:29:22AM -0700, Reinette Chatre wrote:
 +  ret = strtobool(buf, );
 +  if (ret == 0 && bv) {
 +  ret = debugfs_file_get(file->f_path.dentry);
 +  if (unlikely(ret))
 +  return ret;
>>>
>>> Only ever use unlikely/likely if you can measure the performance
>>> difference.  Hint, you can't do that here, it's not needed at all.
>>
>> Here my intention was to follow the current best practices and in the
>> kernel source I am working with eight of the ten usages of
>> debugfs_file_get() is followed by an unlikely(). My assumption was thus
>> that this is a best practice. Thanks for catching this - I'll change it.
> 
> Really?  That's some horrible examples, any pointers to them?  I think I
> need to do a massive sweep of the kernel tree and fix up all of this
> crud so that people don't keep cut/paste the same bad code everywhere.

As you know debugfs_file_get() is a recent addition to the kernel,
introduced in:
commit e9117a5a4bf65d8e99f060d356a04d27a60b436d
Author: Nicolai Stange 
Date:   Tue Oct 31 00:15:48 2017 +0100

debugfs: implement per-file removal protection

Following this introduction, the same author modified the now obsolete
calls of debugfs_use_file_start() to debugfs_file_get() in commits:

commit 7cda7b8f97da9382bb945d541a85cde58d5dac27
Author: Nicolai Stange 
Date:   Tue Oct 31 00:15:51 2017 +0100

IB/hfi1: convert to debugfs_file_get() and -put()


commit 69d29f9e6a53559895e6f785f6cf72daa738f132
Author: Nicolai Stange 
Date:   Tue Oct 31 00:15:50 2017 +0100

debugfs: convert to debugfs_file_get() and -put()


In the above two commits the usage of the new debugfs_file_get()
primarily follows the pattern of:
r = debugfs_file_get(d);
if (unlikely(r))

Since the author of the new interface used the pattern above in the
conversions I do not think it is unreasonable to find other developers
following suit believing that it is a best practice.

This pattern remains in the majority when looking at the output of (on
v4.17-rc5):
git grep -A 1 ' = debugfs_file_get'

 +#ifdef CONFIG_INTEL_RDT_DEBUGFS
 +  plr->debugfs_dir = debugfs_create_dir(rdtgrp->kn->name,
 +debugfs_resctrl);
 +  if (IS_ERR(plr->debugfs_dir)) {
 +  ret = PTR_ERR(plr->debugfs_dir);
 +  plr->debugfs_dir = NULL;
 +  goto out_region;
>>>
>>> Ick no, you never need to care about the return value of a debugfs call.
>>> You code should never do something different if a debugfs call succeeds
>>> or fails.  And you are checking it wrong, even if you did want to do
>>> this :)
>>
>> Ah - I see I need to be using IS_ERR_OR_NULL() instead of IS_ERR()? If
>> this is the case then please note that there seems to be quite a few
>> debugfs_create_dir() calls within the kernel that have the same issue.
> 
> Again, they are all wrong :)
> 
> Just ignore the return value, unless it is a directory, and then just
> save it like you are here.  Don't check the value, you can always pass
> it into a future debugfs call with no problems.

Will do. Thank you very much for the advise.

 +  }
 +
 +  entry = debugfs_create_file("pseudo_lock_measure", 0200,
 +  plr->debugfs_dir, rdtgrp,
 +  _measure_fops);
 +  if (IS_ERR(entry)) {
 +  ret = PTR_ERR(entry);
 +  goto out_debugfs;
 +  }
>>>
>>> Again, you don't care, don't do this.
>>>
 +#ifdef CONFIG_INTEL_RDT_DEBUGFS
 +  debugfs_remove_recursive(rdtgrp->plr->debugfs_dir);
 +#endif
>>>
>>> Don't put ifdefs in .c files, it's not the Linux way at all.  You can
>>> make this a lot simpler/easier to maintain over time if you do not.
>>
>> My mistake - I assumed this would be ok based on my interpretation of
>> how CONFIG_GENERIC_IRQ_DEBUGFS is used.
>>
>> I could rework the debugfs code to be contained in a new debugfs
>> specific .c file that is only compiled if the configuration is set. The
>> ifdefs will then be restricted to a .h file that contains the
>> declarations of these debugfs functions with empty variants when the
>> user did not select the debugfs config option.
>>
>> Would that be acceptable to you?
> 
> Yes, that is the correct way to do this.
> 
> But why would someone _not_ want this option?  Why not always just
> include the functionality, that way you don't have to ask someone to
> rebuild a kernel if you need that debug information.  And distros will
> always enable the option anyway, so it's not like you are keeping things
> "smaller", if you disable debugfs, all of that code should just compile
> away to almost nothing anyway.

Will do.

Thank you very much for taking the time to

Re: [PATCH V4 34/38] x86/intel_rdt: Create debugfs files for pseudo-locking testing

2018-05-23 Thread Reinette Chatre

Hi Greg,

On 5/23/2018 1:05 AM, Greg KH wrote:
> On Tue, May 22, 2018 at 02:02:37PM -0700, Reinette Chatre wrote:
>> On 5/22/2018 12:43 PM, Greg KH wrote:
>>> On Tue, May 22, 2018 at 04:29:22AM -0700, Reinette Chatre wrote:
 +  ret = strtobool(buf, );
 +  if (ret == 0 && bv) {
 +  ret = debugfs_file_get(file->f_path.dentry);
 +  if (unlikely(ret))
 +  return ret;
>>>
>>> Only ever use unlikely/likely if you can measure the performance
>>> difference.  Hint, you can't do that here, it's not needed at all.
>>
>> Here my intention was to follow the current best practices and in the
>> kernel source I am working with eight of the ten usages of
>> debugfs_file_get() is followed by an unlikely(). My assumption was thus
>> that this is a best practice. Thanks for catching this - I'll change it.
> 
> Really?  That's some horrible examples, any pointers to them?  I think I
> need to do a massive sweep of the kernel tree and fix up all of this
> crud so that people don't keep cut/paste the same bad code everywhere.

As you know debugfs_file_get() is a recent addition to the kernel,
introduced in:
commit e9117a5a4bf65d8e99f060d356a04d27a60b436d
Author: Nicolai Stange 
Date:   Tue Oct 31 00:15:48 2017 +0100

debugfs: implement per-file removal protection

Following this introduction, the same author modified the now obsolete
calls of debugfs_use_file_start() to debugfs_file_get() in commits:

commit 7cda7b8f97da9382bb945d541a85cde58d5dac27
Author: Nicolai Stange 
Date:   Tue Oct 31 00:15:51 2017 +0100

IB/hfi1: convert to debugfs_file_get() and -put()


commit 69d29f9e6a53559895e6f785f6cf72daa738f132
Author: Nicolai Stange 
Date:   Tue Oct 31 00:15:50 2017 +0100

debugfs: convert to debugfs_file_get() and -put()


In the above two commits the usage of the new debugfs_file_get()
primarily follows the pattern of:
r = debugfs_file_get(d);
if (unlikely(r))

Since the author of the new interface used the pattern above in the
conversions I do not think it is unreasonable to find other developers
following suit believing that it is a best practice.

This pattern remains in the majority when looking at the output of (on
v4.17-rc5):
git grep -A 1 ' = debugfs_file_get'

 +#ifdef CONFIG_INTEL_RDT_DEBUGFS
 +  plr->debugfs_dir = debugfs_create_dir(rdtgrp->kn->name,
 +debugfs_resctrl);
 +  if (IS_ERR(plr->debugfs_dir)) {
 +  ret = PTR_ERR(plr->debugfs_dir);
 +  plr->debugfs_dir = NULL;
 +  goto out_region;
>>>
>>> Ick no, you never need to care about the return value of a debugfs call.
>>> You code should never do something different if a debugfs call succeeds
>>> or fails.  And you are checking it wrong, even if you did want to do
>>> this :)
>>
>> Ah - I see I need to be using IS_ERR_OR_NULL() instead of IS_ERR()? If
>> this is the case then please note that there seems to be quite a few
>> debugfs_create_dir() calls within the kernel that have the same issue.
> 
> Again, they are all wrong :)
> 
> Just ignore the return value, unless it is a directory, and then just
> save it like you are here.  Don't check the value, you can always pass
> it into a future debugfs call with no problems.

Will do. Thank you very much for the advise.

 +  }
 +
 +  entry = debugfs_create_file("pseudo_lock_measure", 0200,
 +  plr->debugfs_dir, rdtgrp,
 +  _measure_fops);
 +  if (IS_ERR(entry)) {
 +  ret = PTR_ERR(entry);
 +  goto out_debugfs;
 +  }
>>>
>>> Again, you don't care, don't do this.
>>>
 +#ifdef CONFIG_INTEL_RDT_DEBUGFS
 +  debugfs_remove_recursive(rdtgrp->plr->debugfs_dir);
 +#endif
>>>
>>> Don't put ifdefs in .c files, it's not the Linux way at all.  You can
>>> make this a lot simpler/easier to maintain over time if you do not.
>>
>> My mistake - I assumed this would be ok based on my interpretation of
>> how CONFIG_GENERIC_IRQ_DEBUGFS is used.
>>
>> I could rework the debugfs code to be contained in a new debugfs
>> specific .c file that is only compiled if the configuration is set. The
>> ifdefs will then be restricted to a .h file that contains the
>> declarations of these debugfs functions with empty variants when the
>> user did not select the debugfs config option.
>>
>> Would that be acceptable to you?
> 
> Yes, that is the correct way to do this.
> 
> But why would someone _not_ want this option?  Why not always just
> include the functionality, that way you don't have to ask someone to
> rebuild a kernel if you need that debug information.  And distros will
> always enable the option anyway, so it's not like you are keeping things
> "smaller", if you disable debugfs, all of that code should just compile
> away to almost nothing anyway.

Will do.

Thank you very much for taking the time to review and provide
constructive feedback.

Reinette

Re: [PATCH 00/13] atomics: API cleanups

2018-05-23 Thread Peter Zijlstra

On Wed, May 23, 2018 at 02:35:20PM +0100, Mark Rutland wrote:
> This series contains a few cleanups of the atomic API, fixing an
> inconsistency between atomic_* and atomic64_*, and minimizing repetition
> in arch code. This is nicer for arch code, and the improved regularity
> will help when generating the atomic headers in future.
> 
> The bulk of the patches reorganise things so architectures consistently
> provide _fetch_add_unless(), with atomic_fetch_add_unless()
> provided as a wrapper by core code. A generic fallback is provided for
> _fetch_add_unless(), based on _read() and
> _try_cmpxchg().
> 
> Other patches in the series add common fallbacks for:
> 
> * atomic64_inc_not_zero() 
> * _inc_and_test()
> * _dec_and_test()
> * _sub_and_test()
> * add_negative()
> 
> ... as almost all architectures provide identical implementation of
> these today.
> 
> The end result is a strongly negative diffstat, though 
> grows by a reasonable amount. When we generate the headers, we can halve
> this by templating the various fallbacks for atomic{,64}_t.
> 

Thanks for this Mark,

Acked-by: Peter Zijlstra (Intel) 

Ingo, can you magic this into tip somewhere?

Re: [PATCH 00/13] atomics: API cleanups

2018-05-23 Thread Peter Zijlstra

On Wed, May 23, 2018 at 02:35:20PM +0100, Mark Rutland wrote:
> This series contains a few cleanups of the atomic API, fixing an
> inconsistency between atomic_* and atomic64_*, and minimizing repetition
> in arch code. This is nicer for arch code, and the improved regularity
> will help when generating the atomic headers in future.
> 
> The bulk of the patches reorganise things so architectures consistently
> provide _fetch_add_unless(), with atomic_fetch_add_unless()
> provided as a wrapper by core code. A generic fallback is provided for
> _fetch_add_unless(), based on _read() and
> _try_cmpxchg().
> 
> Other patches in the series add common fallbacks for:
> 
> * atomic64_inc_not_zero() 
> * _inc_and_test()
> * _dec_and_test()
> * _sub_and_test()
> * add_negative()
> 
> ... as almost all architectures provide identical implementation of
> these today.
> 
> The end result is a strongly negative diffstat, though 
> grows by a reasonable amount. When we generate the headers, we can halve
> this by templating the various fallbacks for atomic{,64}_t.
> 

Thanks for this Mark,

Acked-by: Peter Zijlstra (Intel) 

Ingo, can you magic this into tip somewhere?

Re: [PATCH 1/2] slimbus: ngd: dt-bindings: Add slim ngd dt bindings

2018-05-23 Thread Srinivas Kandagatla


Thanks Rob for review,

On 23/05/18 17:40, Rob Herring wrote:

On Wed, May 16, 2018 at 05:51:17PM +0100, Srinivas Kandagatla wrote:

This patch adds bindings for Qualcomm SLIMBus NGD controller found in
all new SoCs starting from B family.
SLIMBus NGD controller is a light-weight driver responsible for
communicating with SLIMBus slaves directly over the bus using messaging
interface and communicating with master component residing on ADSP for
bandwidth and data-channel management

Signed-off-by: Srinivas Kandagatla 
---
  .../bindings/slimbus/slim-ngd-qcom-ctrl.txt| 70 ++
  1 file changed, 70 insertions(+)
  create mode 100644 
Documentation/devicetree/bindings/slimbus/slim-ngd-qcom-ctrl.txt

diff --git a/Documentation/devicetree/bindings/slimbus/slim-ngd-qcom-ctrl.txt 
b/Documentation/devicetree/bindings/slimbus/slim-ngd-qcom-ctrl.txt
new file mode 100644
index ..c948fb098819
--- /dev/null
+++ b/Documentation/devicetree/bindings/slimbus/slim-ngd-qcom-ctrl.txt
@@ -0,0 +1,70 @@
+Qualcomm SLIMBus Non Generic Device (NGD) Controller binding
+
+SLIMBus NGD controller is a light-weight driver responsible for communicating
+with SLIMBus slaves directly over the bus using messaging interface and
+communicating with master component residing on ADSP for bandwidth and
+data-channel management
+
+Please refer to slimbus/bus.txt for details of the common SLIMBus bindings.
+
+- compatible:
+   Usage: required
+   Value type: 
+   Definition: must be "qcom,slim-ngd"


SoC specific compatible needed.


Yes, I will add that in v2.



+
+- reg:
+   Usage: required
+   Value type: 
+   Definition: must specify the base address and size of the controller
+   register blocks.


blocks? Is there more than one? If so, how many?

Its just one. I will fix the text to reflect this.




+
+- reg-names:
+   Usage: required
+   Value type: 
+   Definition: must be "ctrl"


reg-names is pointless when there is only 1.


+
+- qcom,ngd-id
+   Usage: required
+   Value type: 
+   Definition: ngd instance id in the controller


Why do you need this?
I have removed this totally in my next version, which I will be posting 
soon.


Thanks,
srini

Re: [PATCH 1/2] slimbus: ngd: dt-bindings: Add slim ngd dt bindings

2018-05-23 Thread Srinivas Kandagatla


Thanks Rob for review,

On 23/05/18 17:40, Rob Herring wrote:

On Wed, May 16, 2018 at 05:51:17PM +0100, Srinivas Kandagatla wrote:

This patch adds bindings for Qualcomm SLIMBus NGD controller found in
all new SoCs starting from B family.
SLIMBus NGD controller is a light-weight driver responsible for
communicating with SLIMBus slaves directly over the bus using messaging
interface and communicating with master component residing on ADSP for
bandwidth and data-channel management

Signed-off-by: Srinivas Kandagatla 
---
  .../bindings/slimbus/slim-ngd-qcom-ctrl.txt| 70 ++
  1 file changed, 70 insertions(+)
  create mode 100644 
Documentation/devicetree/bindings/slimbus/slim-ngd-qcom-ctrl.txt

diff --git a/Documentation/devicetree/bindings/slimbus/slim-ngd-qcom-ctrl.txt 
b/Documentation/devicetree/bindings/slimbus/slim-ngd-qcom-ctrl.txt
new file mode 100644
index ..c948fb098819
--- /dev/null
+++ b/Documentation/devicetree/bindings/slimbus/slim-ngd-qcom-ctrl.txt
@@ -0,0 +1,70 @@
+Qualcomm SLIMBus Non Generic Device (NGD) Controller binding
+
+SLIMBus NGD controller is a light-weight driver responsible for communicating
+with SLIMBus slaves directly over the bus using messaging interface and
+communicating with master component residing on ADSP for bandwidth and
+data-channel management
+
+Please refer to slimbus/bus.txt for details of the common SLIMBus bindings.
+
+- compatible:
+   Usage: required
+   Value type: 
+   Definition: must be "qcom,slim-ngd"


SoC specific compatible needed.


Yes, I will add that in v2.



+
+- reg:
+   Usage: required
+   Value type: 
+   Definition: must specify the base address and size of the controller
+   register blocks.


blocks? Is there more than one? If so, how many?

Its just one. I will fix the text to reflect this.




+
+- reg-names:
+   Usage: required
+   Value type: 
+   Definition: must be "ctrl"


reg-names is pointless when there is only 1.


+
+- qcom,ngd-id
+   Usage: required
+   Value type: 
+   Definition: ngd instance id in the controller


Why do you need this?
I have removed this totally in my next version, which I will be posting 
soon.


Thanks,
srini

Re: [RFC PATCH] mtd: spi-nor: add support to non-uniform SPI NOR flash memories

2018-05-23 Thread Marek Vasut

On 05/23/2018 02:52 PM, Tudor Ambarus wrote:
> Hi, Marek,

Hi,

> On 05/23/2018 12:56 PM, Marek Vasut wrote:
> [...]
 [...]

>>> +    while (len) {
>>> +    cmd = spi_nor_find_best_erase_cmd(map, region, addr, len);
>>> +    if (!cmd)
>>> +    return -EINVAL;
>> What would happen if you realize mid-way that you cannot erase some
>> sector , do you end up with partial erase ?
> Is this possible? In non-overlaid regions, the address is aligned with
> at least one of the erase commands, else -EINVAL. For overlaid regions
> alignment doesn't matter. But yes, if this is possible, in this case,
> this proposal will do a partial erase.
 Shouldn't we fail up front instead ?
>>> It will be great if we can do this without having performance penalties.
>>> Can we loose the conditions for the last erase command? If one wants to
>>> erase 80k chunk starting from offset 0 and only 32k and 64k erase type
>>> are supported, can we erase 96k?
>> No. But can you maybe build a list of erase commands to be executed once
>> you validate that the erase can be performed for example ?
> 
> My second choice was an array witch saves u8 opcode and u32 erasesize.
> There are flashes of 256MB, in the worst case scenario with 4k erase
> type, we will end up with 64K entries.

Some RLE encoding might help here ?

> How about enforcing the length to be multiple of mtd->erasesize, like we
> do in uniform_erase? With this, the problem disappears.

What is the erase size for the 4k-sector 256MiB flash ?

-- 
Best regards,
Marek Vasut

Re: [RFC PATCH] mtd: spi-nor: add support to non-uniform SPI NOR flash memories

2018-05-23 Thread Marek Vasut

On 05/23/2018 02:52 PM, Tudor Ambarus wrote:
> Hi, Marek,

Hi,

> On 05/23/2018 12:56 PM, Marek Vasut wrote:
> [...]
 [...]

>>> +    while (len) {
>>> +    cmd = spi_nor_find_best_erase_cmd(map, region, addr, len);
>>> +    if (!cmd)
>>> +    return -EINVAL;
>> What would happen if you realize mid-way that you cannot erase some
>> sector , do you end up with partial erase ?
> Is this possible? In non-overlaid regions, the address is aligned with
> at least one of the erase commands, else -EINVAL. For overlaid regions
> alignment doesn't matter. But yes, if this is possible, in this case,
> this proposal will do a partial erase.
 Shouldn't we fail up front instead ?
>>> It will be great if we can do this without having performance penalties.
>>> Can we loose the conditions for the last erase command? If one wants to
>>> erase 80k chunk starting from offset 0 and only 32k and 64k erase type
>>> are supported, can we erase 96k?
>> No. But can you maybe build a list of erase commands to be executed once
>> you validate that the erase can be performed for example ?
> 
> My second choice was an array witch saves u8 opcode and u32 erasesize.
> There are flashes of 256MB, in the worst case scenario with 4k erase
> type, we will end up with 64K entries.

Some RLE encoding might help here ?

> How about enforcing the length to be multiple of mtd->erasesize, like we
> do in uniform_erase? With this, the problem disappears.

What is the erase size for the 4k-sector 256MiB flash ?

-- 
Best regards,
Marek Vasut

[RFC PATCH v3 5/9] drivers/block/zram/zram_drv: update usage of zone modifiers

2018-05-23 Thread Huaisheng Ye

From: Huaisheng Ye 

Use __GFP_ZONE_MOVABLE to replace (__GFP_HIGHMEM | __GFP_MOVABLE).

___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 have been deleted from GFP
bitmasks, the bottom three bits of GFP mask is reserved for storing
encoded zone number.

__GFP_ZONE_MOVABLE contains encoded ZONE_MOVABLE and __GFP_MOVABLE flag.

With GFP_ZONE_TABLE, __GFP_HIGHMEM ORing __GFP_MOVABLE means gfp_zone
should return ZONE_MOVABLE. In order to keep that compatible with
GFP_ZONE_TABLE, replace (__GFP_HIGHMEM | __GFP_MOVABLE) with
__GFP_ZONE_MOVABLE.

Signed-off-by: Huaisheng Ye 
Cc: Minchan Kim 
Cc: Nitin Gupta 
Cc: Sergey Senozhatsky 
Cc: Christoph Hellwig 
---
 drivers/block/zram/zram_drv.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 0f3fadd..1bb5ca8 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1004,14 +1004,12 @@ static int __zram_bvec_write(struct zram *zram, struct 
bio_vec *bvec,
handle = zs_malloc(zram->mem_pool, comp_len,
__GFP_KSWAPD_RECLAIM |
__GFP_NOWARN |
-   __GFP_HIGHMEM |
-   __GFP_MOVABLE);
+   __GFP_ZONE_MOVABLE);
if (!handle) {
zcomp_stream_put(zram->comp);
atomic64_inc(>stats.writestall);
handle = zs_malloc(zram->mem_pool, comp_len,
-   GFP_NOIO | __GFP_HIGHMEM |
-   __GFP_MOVABLE);
+   GFP_NOIO | __GFP_ZONE_MOVABLE);
if (handle)
goto compress_again;
return -ENOMEM;
-- 
1.8.3.1

[RFC PATCH v3 5/9] drivers/block/zram/zram_drv: update usage of zone modifiers

2018-05-23 Thread Huaisheng Ye

From: Huaisheng Ye 

Use __GFP_ZONE_MOVABLE to replace (__GFP_HIGHMEM | __GFP_MOVABLE).

___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 have been deleted from GFP
bitmasks, the bottom three bits of GFP mask is reserved for storing
encoded zone number.

__GFP_ZONE_MOVABLE contains encoded ZONE_MOVABLE and __GFP_MOVABLE flag.

With GFP_ZONE_TABLE, __GFP_HIGHMEM ORing __GFP_MOVABLE means gfp_zone
should return ZONE_MOVABLE. In order to keep that compatible with
GFP_ZONE_TABLE, replace (__GFP_HIGHMEM | __GFP_MOVABLE) with
__GFP_ZONE_MOVABLE.

Signed-off-by: Huaisheng Ye 
Cc: Minchan Kim 
Cc: Nitin Gupta 
Cc: Sergey Senozhatsky 
Cc: Christoph Hellwig 
---
 drivers/block/zram/zram_drv.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 0f3fadd..1bb5ca8 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1004,14 +1004,12 @@ static int __zram_bvec_write(struct zram *zram, struct 
bio_vec *bvec,
handle = zs_malloc(zram->mem_pool, comp_len,
__GFP_KSWAPD_RECLAIM |
__GFP_NOWARN |
-   __GFP_HIGHMEM |
-   __GFP_MOVABLE);
+   __GFP_ZONE_MOVABLE);
if (!handle) {
zcomp_stream_put(zram->comp);
atomic64_inc(>stats.writestall);
handle = zs_malloc(zram->mem_pool, comp_len,
-   GFP_NOIO | __GFP_HIGHMEM |
-   __GFP_MOVABLE);
+   GFP_NOIO | __GFP_ZONE_MOVABLE);
if (handle)
goto compress_again;
return -ENOMEM;
-- 
1.8.3.1

Re: [PATCH 3/5] watchdog: sp805: set WDOG_HW_RUNNING when appropriate

2018-05-23 Thread Scott Branden


Raym


On 18-05-23 09:29 AM, Ray Jui wrote:

Hi Robin,

On 5/23/2018 4:48 AM, Robin Murphy wrote:

On 23/05/18 08:52, Scott Branden wrote:



On 18-05-22 04:24 PM, Ray Jui wrote:

Hi Guenter,

On 5/22/2018 1:54 PM, Guenter Roeck wrote:

On Tue, May 22, 2018 at 11:47:18AM -0700, Ray Jui wrote:

If the watchdog hardware is already enabled during the boot process,
when the Linux watchdog driver loads, it should reset the 
watchdog and

tell the watchdog framework. As a result, ping can be generated from
the watchdog framework, until the userspace watchdog daemon takes 
over

control

Signed-off-by: Ray Jui 
Reviewed-by: Vladimir Olovyannikov 


Reviewed-by: Scott Branden 
---
  drivers/watchdog/sp805_wdt.c | 22 ++
  1 file changed, 22 insertions(+)

diff --git a/drivers/watchdog/sp805_wdt.c 
b/drivers/watchdog/sp805_wdt.c

index 1484609..408ffbe 100644
--- a/drivers/watchdog/sp805_wdt.c
+++ b/drivers/watchdog/sp805_wdt.c
@@ -42,6 +42,7 @@
  /* control register masks */
  #define    INT_ENABLE    (1 << 0)
  #define    RESET_ENABLE    (1 << 1)
+    #define    ENABLE_MASK    (INT_ENABLE | RESET_ENABLE)
  #define WDTINTCLR    0x00C
  #define WDTRIS    0x010
  #define WDTMIS    0x014
@@ -74,6 +75,18 @@ module_param(nowayout, bool, 0);
  MODULE_PARM_DESC(nowayout,
  "Set to 1 to keep watchdog running after device release");
  +/* returns true if wdt is running; otherwise returns false */
+static bool wdt_is_running(struct watchdog_device *wdd)
+{
+    struct sp805_wdt *wdt = watchdog_get_drvdata(wdd);
+
+    if ((readl_relaxed(wdt->base + WDTCONTROL) & ENABLE_MASK) ==
+    ENABLE_MASK)
+    return true;
+    else
+    return false;


return !!(readl_relaxed(wdt->base + WDTCONTROL) & ENABLE_MASK));



Note ENABLE_MASK contains two bits (INT_ENABLE and RESET_ENABLE); 
therefore, a simple !!(expression) would not work? That is, the 
masked result needs to be compared against the mask again to ensure 
both bits are set, right?
Ray - your original code looks correct to me.  Easier to read and 
less prone to errors as shown in the attempted translation to a 
single statement.


 if ()
 return true;
 else
 return false;

still looks really dumb, though, and IMO is actually harder to read 
than just "return ;" because it forces you to stop 
and double-check that the logic is, in fact, only doing the obvious 
thing.


If you can propose a way to modify my original code above to make it 
more readable, I'm fine to make the change.


As I mentioned, I don't think the following change proposed by Guenter 
will work due to the reason I pointed out:


return !!(readl_relaxed(wdt->base + WDTCONTROL) & ENABLE_MASK));

What they are looking for is:
return ((readl_relaxed(wdt->base + WDTCONTROL) & ENABLE_MASK) == 
ENABLE_MASK);


or maybe:
return !!((readl_relaxed(wdt->base + WDTCONTROL) & ENABLE_MASK) == 
ENABLE_MASK);




Robin.



p.s. No thanks for making me remember the mind-boggling horror of 
briefly maintaining part of this legacy codebase... :P


$ grep -r '? true : false' --include=*.cpp . | wc -l
951

Re: [PATCH 3/5] watchdog: sp805: set WDOG_HW_RUNNING when appropriate

2018-05-23 Thread Scott Branden


Raym


On 18-05-23 09:29 AM, Ray Jui wrote:

Hi Robin,

On 5/23/2018 4:48 AM, Robin Murphy wrote:

On 23/05/18 08:52, Scott Branden wrote:



On 18-05-22 04:24 PM, Ray Jui wrote:

Hi Guenter,

On 5/22/2018 1:54 PM, Guenter Roeck wrote:

On Tue, May 22, 2018 at 11:47:18AM -0700, Ray Jui wrote:

If the watchdog hardware is already enabled during the boot process,
when the Linux watchdog driver loads, it should reset the 
watchdog and

tell the watchdog framework. As a result, ping can be generated from
the watchdog framework, until the userspace watchdog daemon takes 
over

control

Signed-off-by: Ray Jui 
Reviewed-by: Vladimir Olovyannikov 


Reviewed-by: Scott Branden 
---
  drivers/watchdog/sp805_wdt.c | 22 ++
  1 file changed, 22 insertions(+)

diff --git a/drivers/watchdog/sp805_wdt.c 
b/drivers/watchdog/sp805_wdt.c

index 1484609..408ffbe 100644
--- a/drivers/watchdog/sp805_wdt.c
+++ b/drivers/watchdog/sp805_wdt.c
@@ -42,6 +42,7 @@
  /* control register masks */
  #define    INT_ENABLE    (1 << 0)
  #define    RESET_ENABLE    (1 << 1)
+    #define    ENABLE_MASK    (INT_ENABLE | RESET_ENABLE)
  #define WDTINTCLR    0x00C
  #define WDTRIS    0x010
  #define WDTMIS    0x014
@@ -74,6 +75,18 @@ module_param(nowayout, bool, 0);
  MODULE_PARM_DESC(nowayout,
  "Set to 1 to keep watchdog running after device release");
  +/* returns true if wdt is running; otherwise returns false */
+static bool wdt_is_running(struct watchdog_device *wdd)
+{
+    struct sp805_wdt *wdt = watchdog_get_drvdata(wdd);
+
+    if ((readl_relaxed(wdt->base + WDTCONTROL) & ENABLE_MASK) ==
+    ENABLE_MASK)
+    return true;
+    else
+    return false;


return !!(readl_relaxed(wdt->base + WDTCONTROL) & ENABLE_MASK));



Note ENABLE_MASK contains two bits (INT_ENABLE and RESET_ENABLE); 
therefore, a simple !!(expression) would not work? That is, the 
masked result needs to be compared against the mask again to ensure 
both bits are set, right?
Ray - your original code looks correct to me.  Easier to read and 
less prone to errors as shown in the attempted translation to a 
single statement.


 if ()
 return true;
 else
 return false;

still looks really dumb, though, and IMO is actually harder to read 
than just "return ;" because it forces you to stop 
and double-check that the logic is, in fact, only doing the obvious 
thing.


If you can propose a way to modify my original code above to make it 
more readable, I'm fine to make the change.


As I mentioned, I don't think the following change proposed by Guenter 
will work due to the reason I pointed out:


return !!(readl_relaxed(wdt->base + WDTCONTROL) & ENABLE_MASK));

What they are looking for is:
return ((readl_relaxed(wdt->base + WDTCONTROL) & ENABLE_MASK) == 
ENABLE_MASK);


or maybe:
return !!((readl_relaxed(wdt->base + WDTCONTROL) & ENABLE_MASK) == 
ENABLE_MASK);




Robin.



p.s. No thanks for making me remember the mind-boggling horror of 
briefly maintaining part of this legacy codebase... :P


$ grep -r '? true : false' --include=*.cpp . | wc -l
951

Re: [PATCH 3/5] watchdog: sp805: set WDOG_HW_RUNNING when appropriate

2018-05-23 Thread Robin Murphy


On 23/05/18 17:29, Ray Jui wrote:

Hi Robin,

On 5/23/2018 4:48 AM, Robin Murphy wrote:

On 23/05/18 08:52, Scott Branden wrote:



On 18-05-22 04:24 PM, Ray Jui wrote:

Hi Guenter,

On 5/22/2018 1:54 PM, Guenter Roeck wrote:

On Tue, May 22, 2018 at 11:47:18AM -0700, Ray Jui wrote:

If the watchdog hardware is already enabled during the boot process,
when the Linux watchdog driver loads, it should reset the watchdog 
and

tell the watchdog framework. As a result, ping can be generated from
the watchdog framework, until the userspace watchdog daemon takes 
over

control

Signed-off-by: Ray Jui 
Reviewed-by: Vladimir Olovyannikov 


Reviewed-by: Scott Branden 
---
  drivers/watchdog/sp805_wdt.c | 22 ++
  1 file changed, 22 insertions(+)

diff --git a/drivers/watchdog/sp805_wdt.c 
b/drivers/watchdog/sp805_wdt.c

index 1484609..408ffbe 100644
--- a/drivers/watchdog/sp805_wdt.c
+++ b/drivers/watchdog/sp805_wdt.c
@@ -42,6 +42,7 @@
  /* control register masks */
  #define    INT_ENABLE    (1 << 0)
  #define    RESET_ENABLE    (1 << 1)
+    #define    ENABLE_MASK    (INT_ENABLE | RESET_ENABLE)
  #define WDTINTCLR    0x00C
  #define WDTRIS    0x010
  #define WDTMIS    0x014
@@ -74,6 +75,18 @@ module_param(nowayout, bool, 0);
  MODULE_PARM_DESC(nowayout,
  "Set to 1 to keep watchdog running after device release");
  +/* returns true if wdt is running; otherwise returns false */
+static bool wdt_is_running(struct watchdog_device *wdd)
+{
+    struct sp805_wdt *wdt = watchdog_get_drvdata(wdd);
+
+    if ((readl_relaxed(wdt->base + WDTCONTROL) & ENABLE_MASK) ==
+    ENABLE_MASK)
+    return true;
+    else
+    return false;


return !!(readl_relaxed(wdt->base + WDTCONTROL) & ENABLE_MASK));



Note ENABLE_MASK contains two bits (INT_ENABLE and RESET_ENABLE); 
therefore, a simple !!(expression) would not work? That is, the 
masked result needs to be compared against the mask again to ensure 
both bits are set, right?
Ray - your original code looks correct to me.  Easier to read and 
less prone to errors as shown in the attempted translation to a 
single statement.


 if ()
 return true;
 else
 return false;

still looks really dumb, though, and IMO is actually harder to read 
than just "return ;" because it forces you to stop 
and double-check that the logic is, in fact, only doing the obvious 
thing.


If you can propose a way to modify my original code above to make it 
more readable, I'm fine to make the change.


Well,

return readl_relaxed(wdt->base + WDTCONTROL) & ENABLE_MASK == 
ENABLE_MASK;

would probably be reasonable to anyone other than the 80-column zealots, 
but removing the silly boolean-to-boolean translation idiom really only 
emphasises the fact that it's fundamentally a big complex statement; for 
maximum clarity I'd be inclined to separate the two logical operations 
(read and comparison), e.g.:


u32 wdtcontrol = readl_relaxed(wdt->base + WDTCONTROL);

return wdtcontrol & ENABLE_MASK == ENABLE_MASK;

which is still -3 lines vs. the original.

As I mentioned, I don't think the following change proposed by Guenter 
will work due to the reason I pointed out:


return !!(readl_relaxed(wdt->base + WDTCONTROL) & ENABLE_MASK));


FWIW, getting the desired result should only need one logical not 
swapping for a bitwise one there:


return !(~readl_relaxed(wdt->base + WDTCONTROL) & ENABLE_MASK);

but that's well into "too clever for its own good" territory ;)

Robin.

Re: [PATCH 3/5] watchdog: sp805: set WDOG_HW_RUNNING when appropriate

2018-05-23 Thread Robin Murphy


On 23/05/18 17:29, Ray Jui wrote:

Hi Robin,

On 5/23/2018 4:48 AM, Robin Murphy wrote:

On 23/05/18 08:52, Scott Branden wrote:



On 18-05-22 04:24 PM, Ray Jui wrote:

Hi Guenter,

On 5/22/2018 1:54 PM, Guenter Roeck wrote:

On Tue, May 22, 2018 at 11:47:18AM -0700, Ray Jui wrote:

If the watchdog hardware is already enabled during the boot process,
when the Linux watchdog driver loads, it should reset the watchdog 
and

tell the watchdog framework. As a result, ping can be generated from
the watchdog framework, until the userspace watchdog daemon takes 
over

control

Signed-off-by: Ray Jui 
Reviewed-by: Vladimir Olovyannikov 


Reviewed-by: Scott Branden 
---
  drivers/watchdog/sp805_wdt.c | 22 ++
  1 file changed, 22 insertions(+)

diff --git a/drivers/watchdog/sp805_wdt.c 
b/drivers/watchdog/sp805_wdt.c

index 1484609..408ffbe 100644
--- a/drivers/watchdog/sp805_wdt.c
+++ b/drivers/watchdog/sp805_wdt.c
@@ -42,6 +42,7 @@
  /* control register masks */
  #define    INT_ENABLE    (1 << 0)
  #define    RESET_ENABLE    (1 << 1)
+    #define    ENABLE_MASK    (INT_ENABLE | RESET_ENABLE)
  #define WDTINTCLR    0x00C
  #define WDTRIS    0x010
  #define WDTMIS    0x014
@@ -74,6 +75,18 @@ module_param(nowayout, bool, 0);
  MODULE_PARM_DESC(nowayout,
  "Set to 1 to keep watchdog running after device release");
  +/* returns true if wdt is running; otherwise returns false */
+static bool wdt_is_running(struct watchdog_device *wdd)
+{
+    struct sp805_wdt *wdt = watchdog_get_drvdata(wdd);
+
+    if ((readl_relaxed(wdt->base + WDTCONTROL) & ENABLE_MASK) ==
+    ENABLE_MASK)
+    return true;
+    else
+    return false;


return !!(readl_relaxed(wdt->base + WDTCONTROL) & ENABLE_MASK));



Note ENABLE_MASK contains two bits (INT_ENABLE and RESET_ENABLE); 
therefore, a simple !!(expression) would not work? That is, the 
masked result needs to be compared against the mask again to ensure 
both bits are set, right?
Ray - your original code looks correct to me.  Easier to read and 
less prone to errors as shown in the attempted translation to a 
single statement.


 if ()
 return true;
 else
 return false;

still looks really dumb, though, and IMO is actually harder to read 
than just "return ;" because it forces you to stop 
and double-check that the logic is, in fact, only doing the obvious 
thing.


If you can propose a way to modify my original code above to make it 
more readable, I'm fine to make the change.


Well,

return readl_relaxed(wdt->base + WDTCONTROL) & ENABLE_MASK == 
ENABLE_MASK;

would probably be reasonable to anyone other than the 80-column zealots, 
but removing the silly boolean-to-boolean translation idiom really only 
emphasises the fact that it's fundamentally a big complex statement; for 
maximum clarity I'd be inclined to separate the two logical operations 
(read and comparison), e.g.:


u32 wdtcontrol = readl_relaxed(wdt->base + WDTCONTROL);

return wdtcontrol & ENABLE_MASK == ENABLE_MASK;

which is still -3 lines vs. the original.

As I mentioned, I don't think the following change proposed by Guenter 
will work due to the reason I pointed out:


return !!(readl_relaxed(wdt->base + WDTCONTROL) & ENABLE_MASK));


FWIW, getting the desired result should only need one logical not 
swapping for a bitwise one there:


return !(~readl_relaxed(wdt->base + WDTCONTROL) & ENABLE_MASK);

but that's well into "too clever for its own good" territory ;)

Robin.

[RFC PATCH v3 9/9] arch/x86/include/asm/page.h: update usage of movableflags

2018-05-23 Thread Huaisheng Ye

From: Huaisheng Ye 

GFP_HIGHUSER_MOVABLE doesn't equal to GFP_HIGHUSER | __GFP_MOVABLE,
modify it to adapt patch of getting rid of GFP_ZONE_TABLE/BAD.

Signed-off-by: Huaisheng Ye 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Kate Stewart 
Cc: Greg Kroah-Hartman 
Cc: x...@kernel.org 
Cc: Philippe Ombredanne 
Cc: Christoph Hellwig 
---
 arch/x86/include/asm/page.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 7555b48..a47f42d 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -35,7 +35,8 @@ static inline void copy_user_page(void *to, void *from, 
unsigned long vaddr,
 }
 
 #define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
-   alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
+   alloc_page_vma((movableflags ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER) \
+   | __GFP_ZERO, vma, vaddr)
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
 
 #ifndef __pa
-- 
1.8.3.1

[RFC PATCH v3 9/9] arch/x86/include/asm/page.h: update usage of movableflags

2018-05-23 Thread Huaisheng Ye

From: Huaisheng Ye 

GFP_HIGHUSER_MOVABLE doesn't equal to GFP_HIGHUSER | __GFP_MOVABLE,
modify it to adapt patch of getting rid of GFP_ZONE_TABLE/BAD.

Signed-off-by: Huaisheng Ye 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Kate Stewart 
Cc: Greg Kroah-Hartman 
Cc: x...@kernel.org 
Cc: Philippe Ombredanne 
Cc: Christoph Hellwig 
---
 arch/x86/include/asm/page.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 7555b48..a47f42d 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -35,7 +35,8 @@ static inline void copy_user_page(void *to, void *from, 
unsigned long vaddr,
 }
 
 #define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
-   alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
+   alloc_page_vma((movableflags ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER) \
+   | __GFP_ZERO, vma, vaddr)
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
 
 #ifndef __pa
-- 
1.8.3.1

Re: [PATCH 1/4] rcu: Speed up calling of RCU tasks callbacks

2018-05-23 Thread Paul E. McKenney

On Wed, May 23, 2018 at 12:45:31PM -0400, Steven Rostedt wrote:
> On Wed, 23 May 2018 08:57:34 -0700
> "Paul E. McKenney"  wrote:
> 
> > On Tue, May 22, 2018 at 11:38:12PM -0700, Joel Fernandes wrote:
> > > From: "Joel Fernandes (Google)" 
> > > 
> > > RCU tasks callbacks can take at least 1 second before the callbacks are
> > > executed. This happens even if the hold-out tasks enter their quiescent 
> > > states
> > > quickly. I noticed this when I was testing trampoline callback execution.
> > > 
> > > To test the trampoline freeing, I wrote a simple script:
> > > cd /sys/kernel/debug/tracing/
> > > echo '__schedule_bug:traceon' > set_ftrace_filter;
> > > echo '!__schedule_bug:traceon' > set_ftrace_filter;
> > > 
> > > In the background I had simple bash while loop:
> > > while [ 1 ]; do x=1; done &
> > > 
> > > Total time of completion of above commands in seconds:
> > > 
> > > With this patch:
> > > real0m0.179s
> > > user0m0.000s
> > > sys 0m0.054s
> > > 
> > > Without this patch:
> > > real0m1.098s
> > > user0m0.000s
> > > sys 0m0.053s
> > > 
> > > That's a greater than 6X speed up in performance. In order to accomplish
> > > this, I am waiting for HZ/10 time before entering the hold-out checking
> > > loop. The loop still preserves its checking of held tasks every 1 second
> > > as before, in case this first test doesn't succeed.
> > > 
> > > Cc: Steven Rostedt   
> > 
> > Given an ack from Steven, I would be happy to take this, give or take
> > some nits below.
> 
> I'm currently testing it, and trying to understand it better.

Very good, thank you!

> > > Cc: Peter Zilstra 
> > > Cc: Ingo Molnar 
> > > Cc: Boqun Feng 
> > > Cc: Paul McKenney 
> > > Cc: byungchul.p...@lge.com
> > > Cc: kernel-t...@android.com
> > > Signed-off-by: Joel Fernandes (Google) 
> > > ---
> > >  kernel/rcu/update.c | 12 +++-
> > >  1 file changed, 11 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > > index 5783bdf86e5a..a28698e44b08 100644
> > > --- a/kernel/rcu/update.c
> > > +++ b/kernel/rcu/update.c
> > > @@ -743,6 +743,12 @@ static int __noreturn rcu_tasks_kthread(void *arg)
> > >*/
> > >   synchronize_srcu(_rcu_exit_srcu);
> > > 
> > > + /*
> > > +  * Wait a little bit incase held tasks are released  
> > 
> > in case
> > 
> > > +  * during their next timer ticks.
> > > +  */
> > > + schedule_timeout_interruptible(HZ/10);
> > > +
> > >   /*
> > >* Each pass through the following loop scans the list
> > >* of holdout tasks, removing any that are no longer
> > > @@ -755,7 +761,6 @@ static int __noreturn rcu_tasks_kthread(void *arg)
> > >   int rtst;
> > >   struct task_struct *t1;
> > > 
> > > - schedule_timeout_interruptible(HZ);
> > >   rtst = READ_ONCE(rcu_task_stall_timeout);
> > >   needreport = rtst > 0 &&
> > >time_after(jiffies, lastreport + rtst);
> > > @@ -768,6 +773,11 @@ static int __noreturn rcu_tasks_kthread(void *arg)
> > >   check_holdout_task(t, needreport, );
> > >   cond_resched();
> > >   }
> > > +
> > > + if (list_empty(_tasks_holdouts))
> > > + break;
> > > +
> > > + schedule_timeout_interruptible(HZ);  
> 
> Why is this a full second wait and not the HZ/10 like the others?

The idea is to respond quickly on small idle systems and to reduce the
number of possibly quite lengthy traversals of the task list otherwise.
I actually considered exponential backoff, but decided to keep it simple,
at least to start with.

Thanx, Paul

> -- Steve
> 
> > 
> > Is there a better way to do this?  Can this be converted into a for-loop?
> > Alternatively, would it make sense to have a firsttime local variable
> > initialized to true, to keep the schedule_timeout_interruptible() at
> > the beginning of the loop, but skip it on the first pass through the loop?
> > 
> > Don't get me wrong, what you have looks functionally correct, but
> > duplicating the condition might cause problems later on, for example,
> > should a bug fix be needed in the condition.
> > 
> > >   }
> > > 
> > >   /*
> > > -- 
> > > 2.17.0.441.gb46fe60e1d-goog
> > >   
>

Re: [PATCH 1/4] rcu: Speed up calling of RCU tasks callbacks

2018-05-23 Thread Paul E. McKenney

On Wed, May 23, 2018 at 12:45:31PM -0400, Steven Rostedt wrote:
> On Wed, 23 May 2018 08:57:34 -0700
> "Paul E. McKenney"  wrote:
> 
> > On Tue, May 22, 2018 at 11:38:12PM -0700, Joel Fernandes wrote:
> > > From: "Joel Fernandes (Google)" 
> > > 
> > > RCU tasks callbacks can take at least 1 second before the callbacks are
> > > executed. This happens even if the hold-out tasks enter their quiescent 
> > > states
> > > quickly. I noticed this when I was testing trampoline callback execution.
> > > 
> > > To test the trampoline freeing, I wrote a simple script:
> > > cd /sys/kernel/debug/tracing/
> > > echo '__schedule_bug:traceon' > set_ftrace_filter;
> > > echo '!__schedule_bug:traceon' > set_ftrace_filter;
> > > 
> > > In the background I had simple bash while loop:
> > > while [ 1 ]; do x=1; done &
> > > 
> > > Total time of completion of above commands in seconds:
> > > 
> > > With this patch:
> > > real0m0.179s
> > > user0m0.000s
> > > sys 0m0.054s
> > > 
> > > Without this patch:
> > > real0m1.098s
> > > user0m0.000s
> > > sys 0m0.053s
> > > 
> > > That's a greater than 6X speed up in performance. In order to accomplish
> > > this, I am waiting for HZ/10 time before entering the hold-out checking
> > > loop. The loop still preserves its checking of held tasks every 1 second
> > > as before, in case this first test doesn't succeed.
> > > 
> > > Cc: Steven Rostedt   
> > 
> > Given an ack from Steven, I would be happy to take this, give or take
> > some nits below.
> 
> I'm currently testing it, and trying to understand it better.

Very good, thank you!

> > > Cc: Peter Zilstra 
> > > Cc: Ingo Molnar 
> > > Cc: Boqun Feng 
> > > Cc: Paul McKenney 
> > > Cc: byungchul.p...@lge.com
> > > Cc: kernel-t...@android.com
> > > Signed-off-by: Joel Fernandes (Google) 
> > > ---
> > >  kernel/rcu/update.c | 12 +++-
> > >  1 file changed, 11 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > > index 5783bdf86e5a..a28698e44b08 100644
> > > --- a/kernel/rcu/update.c
> > > +++ b/kernel/rcu/update.c
> > > @@ -743,6 +743,12 @@ static int __noreturn rcu_tasks_kthread(void *arg)
> > >*/
> > >   synchronize_srcu(_rcu_exit_srcu);
> > > 
> > > + /*
> > > +  * Wait a little bit incase held tasks are released  
> > 
> > in case
> > 
> > > +  * during their next timer ticks.
> > > +  */
> > > + schedule_timeout_interruptible(HZ/10);
> > > +
> > >   /*
> > >* Each pass through the following loop scans the list
> > >* of holdout tasks, removing any that are no longer
> > > @@ -755,7 +761,6 @@ static int __noreturn rcu_tasks_kthread(void *arg)
> > >   int rtst;
> > >   struct task_struct *t1;
> > > 
> > > - schedule_timeout_interruptible(HZ);
> > >   rtst = READ_ONCE(rcu_task_stall_timeout);
> > >   needreport = rtst > 0 &&
> > >time_after(jiffies, lastreport + rtst);
> > > @@ -768,6 +773,11 @@ static int __noreturn rcu_tasks_kthread(void *arg)
> > >   check_holdout_task(t, needreport, );
> > >   cond_resched();
> > >   }
> > > +
> > > + if (list_empty(_tasks_holdouts))
> > > + break;
> > > +
> > > + schedule_timeout_interruptible(HZ);  
> 
> Why is this a full second wait and not the HZ/10 like the others?

The idea is to respond quickly on small idle systems and to reduce the
number of possibly quite lengthy traversals of the task list otherwise.
I actually considered exponential backoff, but decided to keep it simple,
at least to start with.

Thanx, Paul

> -- Steve
> 
> > 
> > Is there a better way to do this?  Can this be converted into a for-loop?
> > Alternatively, would it make sense to have a firsttime local variable
> > initialized to true, to keep the schedule_timeout_interruptible() at
> > the beginning of the loop, but skip it on the first pass through the loop?
> > 
> > Don't get me wrong, what you have looks functionally correct, but
> > duplicating the condition might cause problems later on, for example,
> > should a bug fix be needed in the condition.
> > 
> > >   }
> > > 
> > >   /*
> > > -- 
> > > 2.17.0.441.gb46fe60e1d-goog
> > >   
>

Re: [PATCH 01/10] mm: pagecache add lock

2018-05-23 Thread Kent Overstreet

On Wed, May 23, 2018 at 08:22:39AM -0700, Christoph Hellwig wrote:
> On Sun, May 20, 2018 at 06:45:24PM -0400, Kent Overstreet wrote:
> > > 
> > > Honestly I think this probably should be in the core.  But IFF we move
> > > it to the core the existing users of per-fs locks need to be moved
> > > over first.  E.g. XFS as the very first one, and at least ext4 and f2fs
> > > that copied the approach, and probably more if you audit deep enough.
> > 
> > I'm not going to go and redo locking in XFS and ext4 as a prerequisite to
> > merging bcachefs. Sorry, but that's a bit crazy.
> 
> It isn't crazy at all.  In general we expect people to do their fair
> share of core work to get their pet feature in.  How much is acceptable
> is a difficult question and not black and white.
> 
> But if you want to grow a critical core structure you better take a stab
> at converting existing users.  Without that the tradeoff of growing
> that core structure is simply not given.
> 
> Or to put it in words for this exact feature:  unless your new field
> is also used by mainstream file systems it is not going to be added.

Christoph, I'm really not someone you can accuse of avoiding my share of core
work and refactoring and you know it.

When are you going to get around to converting existing users of fs/direct-io.c
to iomap so it can be deleted? The kernel is carrying around two dio
implementations right now thanks to you. Not a good situation, is it?

Re: [PATCH 01/10] mm: pagecache add lock

2018-05-23 Thread Kent Overstreet

On Wed, May 23, 2018 at 08:22:39AM -0700, Christoph Hellwig wrote:
> On Sun, May 20, 2018 at 06:45:24PM -0400, Kent Overstreet wrote:
> > > 
> > > Honestly I think this probably should be in the core.  But IFF we move
> > > it to the core the existing users of per-fs locks need to be moved
> > > over first.  E.g. XFS as the very first one, and at least ext4 and f2fs
> > > that copied the approach, and probably more if you audit deep enough.
> > 
> > I'm not going to go and redo locking in XFS and ext4 as a prerequisite to
> > merging bcachefs. Sorry, but that's a bit crazy.
> 
> It isn't crazy at all.  In general we expect people to do their fair
> share of core work to get their pet feature in.  How much is acceptable
> is a difficult question and not black and white.
> 
> But if you want to grow a critical core structure you better take a stab
> at converting existing users.  Without that the tradeoff of growing
> that core structure is simply not given.
> 
> Or to put it in words for this exact feature:  unless your new field
> is also used by mainstream file systems it is not going to be added.

Christoph, I'm really not someone you can accuse of avoiding my share of core
work and refactoring and you know it.

When are you going to get around to converting existing users of fs/direct-io.c
to iomap so it can be deleted? The kernel is carrying around two dio
implementations right now thanks to you. Not a good situation, is it?

[RFC PATCH v3 8/9] include/linux/highmem.h: update usage of movableflags

2018-05-23 Thread Huaisheng Ye

From: Huaisheng Ye 

GFP_HIGHUSER_MOVABLE doesn't equal to GFP_HIGHUSER | __GFP_MOVABLE,
modify it to adapt patch of getting rid of GFP_ZONE_TABLE/BAD.

Signed-off-by: Huaisheng Ye 
Cc: Kate Stewart 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Philippe Ombredanne 
Cc: Christoph Hellwig 
---
 include/linux/highmem.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 0690679..5383c9e 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -159,8 +159,8 @@ static inline void clear_user_highpage(struct page *page, 
unsigned long vaddr)
struct vm_area_struct *vma,
unsigned long vaddr)
 {
-   struct page *page = alloc_page_vma(GFP_HIGHUSER | movableflags,
-   vma, vaddr);
+   struct page *page = alloc_page_vma(movableflags ?
+   GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER, vma, vaddr);
 
if (page)
clear_user_highpage(page, vaddr);
-- 
1.8.3.1

[RFC PATCH v3 8/9] include/linux/highmem.h: update usage of movableflags

2018-05-23 Thread Huaisheng Ye

From: Huaisheng Ye 

GFP_HIGHUSER_MOVABLE doesn't equal to GFP_HIGHUSER | __GFP_MOVABLE,
modify it to adapt patch of getting rid of GFP_ZONE_TABLE/BAD.

Signed-off-by: Huaisheng Ye 
Cc: Kate Stewart 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Philippe Ombredanne 
Cc: Christoph Hellwig 
---
 include/linux/highmem.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 0690679..5383c9e 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -159,8 +159,8 @@ static inline void clear_user_highpage(struct page *page, 
unsigned long vaddr)
struct vm_area_struct *vma,
unsigned long vaddr)
 {
-   struct page *page = alloc_page_vma(GFP_HIGHUSER | movableflags,
-   vma, vaddr);
+   struct page *page = alloc_page_vma(movableflags ?
+   GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER, vma, vaddr);
 
if (page)
clear_user_highpage(page, vaddr);
-- 
1.8.3.1

[RFC PATCH v3 7/9] mm/zsmalloc: update usage of zone modifiers

2018-05-23 Thread Huaisheng Ye

From: Huaisheng Ye 

Use __GFP_ZONE_MOVABLE to replace (__GFP_HIGHMEM | __GFP_MOVABLE).

___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 have been deleted from GFP
bitmasks, the bottom three bits of GFP mask is reserved for storing
encoded zone number.

__GFP_ZONE_MOVABLE contains encoded ZONE_MOVABLE and __GFP_MOVABLE flag.

With GFP_ZONE_TABLE, __GFP_HIGHMEM ORing __GFP_MOVABLE means gfp_zone
should return ZONE_MOVABLE. In order to keep that compatible with
GFP_ZONE_TABLE, Use GFP_NORMAL_UNMOVABLE() to clear bottom 4 bits of
GFP bitmaks.

Signed-off-by: Huaisheng Ye 
Cc: Minchan Kim 
Cc: Nitin Gupta 
Cc: Sergey Senozhatsky 
Cc: Christoph Hellwig 
---
 mm/zsmalloc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 61cb05d..e250c69 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -345,7 +345,7 @@ static void destroy_cache(struct zs_pool *pool)
 static unsigned long cache_alloc_handle(struct zs_pool *pool, gfp_t gfp)
 {
return (unsigned long)kmem_cache_alloc(pool->handle_cachep,
-   gfp & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
+   GFP_NORMAL_UNMOVABLE(gfp));
 }
 
 static void cache_free_handle(struct zs_pool *pool, unsigned long handle)
@@ -356,7 +356,7 @@ static void cache_free_handle(struct zs_pool *pool, 
unsigned long handle)
 static struct zspage *cache_alloc_zspage(struct zs_pool *pool, gfp_t flags)
 {
return kmem_cache_alloc(pool->zspage_cachep,
-   flags & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
+   GFP_NORMAL_UNMOVABLE(flags));
 }
 
 static void cache_free_zspage(struct zs_pool *pool, struct zspage *zspage)
-- 
1.8.3.1

Re: Suggestion: „spectre_v2=off“ and „nopti“ per default in "Intel Atom N270" case?

2018-05-23 Thread Johannes Hirte

On 2018 Mai 23, Pavel Machek wrote:
> On Sat 2018-05-19 21:53:02, Christian Krüger wrote:
> > Hi,
> > 
> > Since the old "in-order-execution" Intel CPUs like the Intel Atom N270
> > (known for being installed in many Netbooks and Nettops) are not sensitive
> > for "Meltdown" & "Spectre" , wouldn't it be a good idea to exclude these
> > anyway "weak" CPUs from the costly patches by default?
> > 
> > Browsing the web, I can "feel the difference" if the matching kernel options
> > are applied on such a device.
> 
> Can you also measure the difference? Placebo effect is hard to avoid.
> 
> But yes, we do not need to do workarounds on non-buggy machines...
> 
>   Pavel

On my Atom N270 there doesn't seem to be any workaround active with
kernel 4.14.42:

localhost ~ # cat /sys/devices/system/cpu/vulnerabilities/*
Not affected
Not affected
Not affected

Christian, did you verified the mitigations are active on your system?
What kernels are affected? 

-- 
Regards,
  Johannes

[RFC PATCH v3 7/9] mm/zsmalloc: update usage of zone modifiers

2018-05-23 Thread Huaisheng Ye

From: Huaisheng Ye 

Use __GFP_ZONE_MOVABLE to replace (__GFP_HIGHMEM | __GFP_MOVABLE).

___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 have been deleted from GFP
bitmasks, the bottom three bits of GFP mask is reserved for storing
encoded zone number.

__GFP_ZONE_MOVABLE contains encoded ZONE_MOVABLE and __GFP_MOVABLE flag.

With GFP_ZONE_TABLE, __GFP_HIGHMEM ORing __GFP_MOVABLE means gfp_zone
should return ZONE_MOVABLE. In order to keep that compatible with
GFP_ZONE_TABLE, Use GFP_NORMAL_UNMOVABLE() to clear bottom 4 bits of
GFP bitmaks.

Signed-off-by: Huaisheng Ye 
Cc: Minchan Kim 
Cc: Nitin Gupta 
Cc: Sergey Senozhatsky 
Cc: Christoph Hellwig 
---
 mm/zsmalloc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 61cb05d..e250c69 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -345,7 +345,7 @@ static void destroy_cache(struct zs_pool *pool)
 static unsigned long cache_alloc_handle(struct zs_pool *pool, gfp_t gfp)
 {
return (unsigned long)kmem_cache_alloc(pool->handle_cachep,
-   gfp & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
+   GFP_NORMAL_UNMOVABLE(gfp));
 }
 
 static void cache_free_handle(struct zs_pool *pool, unsigned long handle)
@@ -356,7 +356,7 @@ static void cache_free_handle(struct zs_pool *pool, 
unsigned long handle)
 static struct zspage *cache_alloc_zspage(struct zs_pool *pool, gfp_t flags)
 {
return kmem_cache_alloc(pool->zspage_cachep,
-   flags & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
+   GFP_NORMAL_UNMOVABLE(flags));
 }
 
 static void cache_free_zspage(struct zs_pool *pool, struct zspage *zspage)
-- 
1.8.3.1

Re: Suggestion: „spectre_v2=off“ and „nopti“ per default in "Intel Atom N270" case?

2018-05-23 Thread Johannes Hirte

On 2018 Mai 23, Pavel Machek wrote:
> On Sat 2018-05-19 21:53:02, Christian Krüger wrote:
> > Hi,
> > 
> > Since the old "in-order-execution" Intel CPUs like the Intel Atom N270
> > (known for being installed in many Netbooks and Nettops) are not sensitive
> > for "Meltdown" & "Spectre" , wouldn't it be a good idea to exclude these
> > anyway "weak" CPUs from the costly patches by default?
> > 
> > Browsing the web, I can "feel the difference" if the matching kernel options
> > are applied on such a device.
> 
> Can you also measure the difference? Placebo effect is hard to avoid.
> 
> But yes, we do not need to do workarounds on non-buggy machines...
> 
>   Pavel

On my Atom N270 there doesn't seem to be any workaround active with
kernel 4.14.42:

localhost ~ # cat /sys/devices/system/cpu/vulnerabilities/*
Not affected
Not affected
Not affected

Christian, did you verified the mitigations are active on your system?
What kernels are affected? 

-- 
Regards,
  Johannes

Re: [PATCH v2 0/4] ARM64: dts: meson-axg: i2c updates

2018-05-23 Thread Kevin Hilman

Jerome Brunet  writes:

> This patchset fixes a few problems found in the i2c nodes of
> amlogic's meson-axg platform. It also adds the missing pins for
> AO controller so we can use it on the S400
>
> This series has been tested on the S400 board.
>
> Jerome Brunet (4):
>   ARM64: dts: meson-axg: clean-up i2c nodes
>   ARM64: dts: meson-axg: correct i2c AO clock
>   ARM64: dts: meson-axg: add i2c AO pins
>   ARM64: dts: meson-axg: enable i2c AO on the S400 board

Applied to v4.18/dt64,

Kevin

Re: [PATCH v2 0/4] ARM64: dts: meson-axg: i2c updates

2018-05-23 Thread Kevin Hilman

Jerome Brunet  writes:

> This patchset fixes a few problems found in the i2c nodes of
> amlogic's meson-axg platform. It also adds the missing pins for
> AO controller so we can use it on the S400
>
> This series has been tested on the S400 board.
>
> Jerome Brunet (4):
>   ARM64: dts: meson-axg: clean-up i2c nodes
>   ARM64: dts: meson-axg: correct i2c AO clock
>   ARM64: dts: meson-axg: add i2c AO pins
>   ARM64: dts: meson-axg: enable i2c AO on the S400 board

Applied to v4.18/dt64,

Kevin

Re: [PATCH v3 7/9] dax: report bytes remaining in dax_iomap_actor()

2018-05-23 Thread Ross Zwisler

On Wed, May 23, 2018 at 09:53:38AM -0700, Dan Williams wrote:
> On Wed, May 23, 2018 at 9:47 AM, Ross Zwisler
>  wrote:
> > On Wed, May 23, 2018 at 09:39:04AM -0700, Dan Williams wrote:
> >> On Wed, May 23, 2018 at 9:34 AM, Ross Zwisler
> >>  wrote:
> >> > On Thu, May 03, 2018 at 05:06:42PM -0700, Dan Williams wrote:
> >> >> In preparation for protecting the dax read(2) path from media errors
> >> >> with copy_to_iter_mcsafe() (via dax_copy_to_iter()), convert the
> >> >> implementation to report the bytes successfully transferred.
> >> >>
> >> >> Cc: 
> >> >> Cc: Ingo Molnar 
> >> >> Cc: Borislav Petkov 
> >> >> Cc: Tony Luck 
> >> >> Cc: Al Viro 
> >> >> Cc: Thomas Gleixner 
> >> >> Cc: Andy Lutomirski 
> >> >> Cc: Peter Zijlstra 
> >> >> Cc: Andrew Morton 
> >> >> Cc: Linus Torvalds 
> >> >> Signed-off-by: Dan Williams 
> >> >> ---
> >> >>  fs/dax.c |   20 +++-
> >> >>  1 file changed, 11 insertions(+), 9 deletions(-)
> >> >>
> >> >> diff --git a/fs/dax.c b/fs/dax.c
> >> >> index a64afdf7ec0d..34a2d435ae4b 100644
> >> >> --- a/fs/dax.c
> >> >> +++ b/fs/dax.c
> >> >> @@ -991,6 +991,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, 
> >> >> loff_t length, void *data,
> >> >>   struct iov_iter *iter = data;
> >> >>   loff_t end = pos + length, done = 0;
> >> >>   ssize_t ret = 0;
> >> >> + size_t xfer;
> >> >>   int id;
> >> >>
> >> >>   if (iov_iter_rw(iter) == READ) {
> >> >> @@ -1054,19 +1055,20 @@ dax_iomap_actor(struct inode *inode, loff_t 
> >> >> pos, loff_t length, void *data,
> >> >>* vfs_write(), depending on which operation we are doing.
> >> >>*/
> >> >>   if (iov_iter_rw(iter) == WRITE)
> >> >> - map_len = dax_copy_from_iter(dax_dev, pgoff, 
> >> >> kaddr,
> >> >> + xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr,
> >> >>   map_len, iter);
> >> >>   else
> >> >> - map_len = dax_copy_to_iter(dax_dev, pgoff, kaddr,
> >> >> + xfer = dax_copy_to_iter(dax_dev, pgoff, kaddr,
> >> >>   map_len, iter);
> >> >> - if (map_len <= 0) {
> >> >> - ret = map_len ? map_len : -EFAULT;
> >> >> - break;
> >> >> - }
> >> >>
> >> >> - pos += map_len;
> >> >> - length -= map_len;
> >> >> - done += map_len;
> >> >> + pos += xfer;
> >> >> + length -= xfer;
> >> >> + done += xfer;
> >> >> +
> >> >> + if (xfer == 0)
> >> >> + ret = -EFAULT;
> >> >> + if (xfer < map_len)
> >> >> + break;
> >> >
> >> > I'm confused by this error handling.  So if we hit an error on a given 
> >> > iov and
> >> > we don't transfer the expected number of bytes, we have two cases:
> >> >
> >> > 1) We transferred *something* on this iov but not everything - return 
> >> > success.
> >> > 2) We didn't transfer anything on this iov - return -EFAULT.
> >> >
> >> > Both of these are true regardless of data transferred on previous iovs.
> >> >
> >> > Why the distinction?  If a given iov is interrupted, regardless of 
> >> > whether it
> >> > transferred 0 bytes or 1, shouldn't the error path be the same?
> >>
> >> This is is the semantics of read(2) / write(2). Quoting the pwrite man 
> >> page:
> >>
> >>Note that is not an error for  a  successful  call  to
> >> transfer  fewer  bytes  than
> >>requested (see read(2) and write(2)).
> >
> > Consider this case:
> >
> > I have 4 IOVs, each of a full page.  The first three transfer their full 
> > page,
> > but on the third we hit an error.
> >
> > If we transferred 0 bytes in the fourth page, we'll return -EFAULT.
> >
> > If we transferred 1 byte in the fourth page, we'll return the total length
> > transferred, so 3 pages + 1 byte.
> >
> > Why?  pwrite(2) says it returns the number of bytes written, which can be 
> > less
> > than the total requested.  Why not just return the length transferred in 
> > both
> > cases, instead of returning -EFAULT for one of them?
> 
> Ah, now I see. Yes, that's a bug. Once we have successfully completed
> any iovec we should be returning bytes transferred not -EFAULT.

Actually, your code is fine.  This is handled by the:

 return done ? done : ret;

at the end of the function.  So if we've transferred any data at all, we'll
return the number of bytes transferred, and if we didn't we'll return -EFAULT
because 0 is the special case which means EOF according to pread(2)/pwrite(2).

Looks good, then.  Thanks

Re: [PATCH v3 7/9] dax: report bytes remaining in dax_iomap_actor()

2018-05-23 Thread Ross Zwisler

On Wed, May 23, 2018 at 09:53:38AM -0700, Dan Williams wrote:
> On Wed, May 23, 2018 at 9:47 AM, Ross Zwisler
>  wrote:
> > On Wed, May 23, 2018 at 09:39:04AM -0700, Dan Williams wrote:
> >> On Wed, May 23, 2018 at 9:34 AM, Ross Zwisler
> >>  wrote:
> >> > On Thu, May 03, 2018 at 05:06:42PM -0700, Dan Williams wrote:
> >> >> In preparation for protecting the dax read(2) path from media errors
> >> >> with copy_to_iter_mcsafe() (via dax_copy_to_iter()), convert the
> >> >> implementation to report the bytes successfully transferred.
> >> >>
> >> >> Cc: 
> >> >> Cc: Ingo Molnar 
> >> >> Cc: Borislav Petkov 
> >> >> Cc: Tony Luck 
> >> >> Cc: Al Viro 
> >> >> Cc: Thomas Gleixner 
> >> >> Cc: Andy Lutomirski 
> >> >> Cc: Peter Zijlstra 
> >> >> Cc: Andrew Morton 
> >> >> Cc: Linus Torvalds 
> >> >> Signed-off-by: Dan Williams 
> >> >> ---
> >> >>  fs/dax.c |   20 +++-
> >> >>  1 file changed, 11 insertions(+), 9 deletions(-)
> >> >>
> >> >> diff --git a/fs/dax.c b/fs/dax.c
> >> >> index a64afdf7ec0d..34a2d435ae4b 100644
> >> >> --- a/fs/dax.c
> >> >> +++ b/fs/dax.c
> >> >> @@ -991,6 +991,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, 
> >> >> loff_t length, void *data,
> >> >>   struct iov_iter *iter = data;
> >> >>   loff_t end = pos + length, done = 0;
> >> >>   ssize_t ret = 0;
> >> >> + size_t xfer;
> >> >>   int id;
> >> >>
> >> >>   if (iov_iter_rw(iter) == READ) {
> >> >> @@ -1054,19 +1055,20 @@ dax_iomap_actor(struct inode *inode, loff_t 
> >> >> pos, loff_t length, void *data,
> >> >>* vfs_write(), depending on which operation we are doing.
> >> >>*/
> >> >>   if (iov_iter_rw(iter) == WRITE)
> >> >> - map_len = dax_copy_from_iter(dax_dev, pgoff, 
> >> >> kaddr,
> >> >> + xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr,
> >> >>   map_len, iter);
> >> >>   else
> >> >> - map_len = dax_copy_to_iter(dax_dev, pgoff, kaddr,
> >> >> + xfer = dax_copy_to_iter(dax_dev, pgoff, kaddr,
> >> >>   map_len, iter);
> >> >> - if (map_len <= 0) {
> >> >> - ret = map_len ? map_len : -EFAULT;
> >> >> - break;
> >> >> - }
> >> >>
> >> >> - pos += map_len;
> >> >> - length -= map_len;
> >> >> - done += map_len;
> >> >> + pos += xfer;
> >> >> + length -= xfer;
> >> >> + done += xfer;
> >> >> +
> >> >> + if (xfer == 0)
> >> >> + ret = -EFAULT;
> >> >> + if (xfer < map_len)
> >> >> + break;
> >> >
> >> > I'm confused by this error handling.  So if we hit an error on a given 
> >> > iov and
> >> > we don't transfer the expected number of bytes, we have two cases:
> >> >
> >> > 1) We transferred *something* on this iov but not everything - return 
> >> > success.
> >> > 2) We didn't transfer anything on this iov - return -EFAULT.
> >> >
> >> > Both of these are true regardless of data transferred on previous iovs.
> >> >
> >> > Why the distinction?  If a given iov is interrupted, regardless of 
> >> > whether it
> >> > transferred 0 bytes or 1, shouldn't the error path be the same?
> >>
> >> This is is the semantics of read(2) / write(2). Quoting the pwrite man 
> >> page:
> >>
> >>Note that is not an error for  a  successful  call  to
> >> transfer  fewer  bytes  than
> >>requested (see read(2) and write(2)).
> >
> > Consider this case:
> >
> > I have 4 IOVs, each of a full page.  The first three transfer their full 
> > page,
> > but on the third we hit an error.
> >
> > If we transferred 0 bytes in the fourth page, we'll return -EFAULT.
> >
> > If we transferred 1 byte in the fourth page, we'll return the total length
> > transferred, so 3 pages + 1 byte.
> >
> > Why?  pwrite(2) says it returns the number of bytes written, which can be 
> > less
> > than the total requested.  Why not just return the length transferred in 
> > both
> > cases, instead of returning -EFAULT for one of them?
> 
> Ah, now I see. Yes, that's a bug. Once we have successfully completed
> any iovec we should be returning bytes transferred not -EFAULT.

Actually, your code is fine.  This is handled by the:

 return done ? done : ret;

at the end of the function.  So if we've transferred any data at all, we'll
return the number of bytes transferred, and if we didn't we'll return -EFAULT
because 0 is the special case which means EOF according to pread(2)/pwrite(2).

Looks good, then.  Thanks for answering my questions. 

Reviewed-by: Ross Zwisler

Re: [PATCH] clk: aspeed: Add 24MHz fixed clock

2018-05-23 Thread Rob Herring

On Fri, May 18, 2018 at 04:57:02PM +0800, Lei YU wrote:
> Add a 24MHz fixed clock.
> This clock will be used for certain devices, e.g. pwm.
> 
> Signed-off-by: Lei YU 
> ---
>  drivers/clk/clk-aspeed.c | 9 -
>  include/dt-bindings/clock/aspeed-clock.h | 1 +
>  2 files changed, 9 insertions(+), 1 deletion(-)

Acked-by: Rob Herring

Re: [PATCH] clk: aspeed: Add 24MHz fixed clock

2018-05-23 Thread Rob Herring

On Fri, May 18, 2018 at 04:57:02PM +0800, Lei YU wrote:
> Add a 24MHz fixed clock.
> This clock will be used for certain devices, e.g. pwm.
> 
> Signed-off-by: Lei YU 
> ---
>  drivers/clk/clk-aspeed.c | 9 -
>  include/dt-bindings/clock/aspeed-clock.h | 1 +
>  2 files changed, 9 insertions(+), 1 deletion(-)

Acked-by: Rob Herring

Re: [PATCH 2/2] PM / devfreq: Generic cpufreq governor

2018-05-23 Thread Rob Herring

On Fri, May 18, 2018 at 01:24:48AM -0700, Saravana Kannan wrote:
> This devfreq governor is a generic implementation that can scale any
> devfreq device based on the current CPU frequency of all ONLINE CPUs. It
> allows for specifying CPU freq to devfreq mapping for specific devices.
> When such a mapping is not present, it defaults to scaling the device
> frequency in proportion to the CPU frequency.
> 
> Change-Id: I7f786b9059435afe85b9ec8c504a4655731ee20e

drop this.

> Signed-off-by: Saravana Kannan 
> ---
>  .../bindings/devfreq/devfreq-cpufreq.txt   |  53 ++

Please split bindings to separate patch.

>  drivers/devfreq/Kconfig|   8 +
>  drivers/devfreq/Makefile   |   1 +
>  drivers/devfreq/governor_cpufreq.c | 628 
> +
>  4 files changed, 690 insertions(+)
>  create mode 100644 
> Documentation/devicetree/bindings/devfreq/devfreq-cpufreq.txt
>  create mode 100644 drivers/devfreq/governor_cpufreq.c
> 
> diff --git a/Documentation/devicetree/bindings/devfreq/devfreq-cpufreq.txt 
> b/Documentation/devicetree/bindings/devfreq/devfreq-cpufreq.txt
> new file mode 100644
> index 000..6537538
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/devfreq/devfreq-cpufreq.txt
> @@ -0,0 +1,53 @@
> +Devfreq CPUfreq governor
> +
> +devfreq-cpufreq is a parent device that contains one or more child devices.
> +Each child device provides CPU frequency to device frequency mapping for a
> +specific device. Examples of devices that could use this are: DDR, cache and
> +CCI.
> +
> +Parent device name shall be "devfreq-cpufreq".

I don't really understand any of this and how it relates to the other 
QCom cpufreq and devfreq bindings. Seems like this all needs some 
discussion amongst the PM folks first.

Rob

Re: [PATCH 2/2] PM / devfreq: Generic cpufreq governor

2018-05-23 Thread Rob Herring

On Fri, May 18, 2018 at 01:24:48AM -0700, Saravana Kannan wrote:
> This devfreq governor is a generic implementation that can scale any
> devfreq device based on the current CPU frequency of all ONLINE CPUs. It
> allows for specifying CPU freq to devfreq mapping for specific devices.
> When such a mapping is not present, it defaults to scaling the device
> frequency in proportion to the CPU frequency.
> 
> Change-Id: I7f786b9059435afe85b9ec8c504a4655731ee20e

drop this.

> Signed-off-by: Saravana Kannan 
> ---
>  .../bindings/devfreq/devfreq-cpufreq.txt   |  53 ++

Please split bindings to separate patch.

>  drivers/devfreq/Kconfig|   8 +
>  drivers/devfreq/Makefile   |   1 +
>  drivers/devfreq/governor_cpufreq.c | 628 
> +
>  4 files changed, 690 insertions(+)
>  create mode 100644 
> Documentation/devicetree/bindings/devfreq/devfreq-cpufreq.txt
>  create mode 100644 drivers/devfreq/governor_cpufreq.c
> 
> diff --git a/Documentation/devicetree/bindings/devfreq/devfreq-cpufreq.txt 
> b/Documentation/devicetree/bindings/devfreq/devfreq-cpufreq.txt
> new file mode 100644
> index 000..6537538
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/devfreq/devfreq-cpufreq.txt
> @@ -0,0 +1,53 @@
> +Devfreq CPUfreq governor
> +
> +devfreq-cpufreq is a parent device that contains one or more child devices.
> +Each child device provides CPU frequency to device frequency mapping for a
> +specific device. Examples of devices that could use this are: DDR, cache and
> +CCI.
> +
> +Parent device name shall be "devfreq-cpufreq".

I don't really understand any of this and how it relates to the other 
QCom cpufreq and devfreq bindings. Seems like this all needs some 
discussion amongst the PM folks first.

Rob

Re: [PATCH net] net: phy: broadcom: Fix bcm_write_exp()

2018-05-23 Thread Florian Fainelli

On 05/22/2018 06:20 PM, Florian Fainelli wrote:
> Hi Andrew,
> 
> On 05/22/2018 05:15 PM, Andrew Lunn wrote:
>> On Tue, May 22, 2018 at 05:04:49PM -0700, Florian Fainelli wrote:
>>> On newer PHYs, we need to select the expansion register to write with
>>> setting bits [11:8] to 0xf. This was done correctly by bcm7xxx.c prior
>>> to being migrated to generic code under bcm-phy-lib.c which
>>> unfortunately used the older implementation from the BCM54xx days.
>>
>> Hi Florian
>>
>> Does selecting the expansion register affect access to the standard
>> registers? Does this need locking like the Marvell PHY has when
>> changing pages?
> 
> We should probably convert this to the page accessors since the
> expansion, misc and other shadow 0x1c accesses are all indirection
> layers to poke into a different address space of the PHY. That would be
> a separate fix though for a number of reasons.

I realize I did not quite answer your question, the answer to your
question AFAICT is no, setting the expansion register sequence and then
aborting mid-way is not a problem and does not impact the standard MII
registers because of how this is implemented. The registers are accessed
and latched through a specific indirect sequence, but there is no page
switching unlike the Marvell PHYs
-- 
Florian

Re: [PATCH net] net: phy: broadcom: Fix bcm_write_exp()

2018-05-23 Thread Florian Fainelli

On 05/22/2018 06:20 PM, Florian Fainelli wrote:
> Hi Andrew,
> 
> On 05/22/2018 05:15 PM, Andrew Lunn wrote:
>> On Tue, May 22, 2018 at 05:04:49PM -0700, Florian Fainelli wrote:
>>> On newer PHYs, we need to select the expansion register to write with
>>> setting bits [11:8] to 0xf. This was done correctly by bcm7xxx.c prior
>>> to being migrated to generic code under bcm-phy-lib.c which
>>> unfortunately used the older implementation from the BCM54xx days.
>>
>> Hi Florian
>>
>> Does selecting the expansion register affect access to the standard
>> registers? Does this need locking like the Marvell PHY has when
>> changing pages?
> 
> We should probably convert this to the page accessors since the
> expansion, misc and other shadow 0x1c accesses are all indirection
> layers to poke into a different address space of the PHY. That would be
> a separate fix though for a number of reasons.

I realize I did not quite answer your question, the answer to your
question AFAICT is no, setting the expansion register sequence and then
aborting mid-way is not a problem and does not impact the standard MII
registers because of how this is implemented. The registers are accessed
and latched through a specific indirect sequence, but there is no page
switching unlike the Marvell PHYs
-- 
Florian

Re: [PATCH v7 0/7] KVM: x86: Allow Qemu/KVM to use PVH entry point

2018-05-23 Thread Maran Wilson


On 5/18/2018 4:31 AM, Paolo Bonzini wrote:

On 16/05/2018 22:27, Maran Wilson wrote:

Friendly ping. I am hopeful one of the x86 and/or KVM maintainers has a
few cycles to spare to look this over.

And thanks to everyone who has helped thus far by providing valuable
feedback and reviewing.

    https://lkml.org/lkml/2018/4/16/1002

KVM bits look fine.  This would be the right time to post the QEMU
patches...


Thanks Paolo. Yes, we will have the Qemu patches out soon. It is being 
actively worked. We have had one implementation in place, but decided to 
re-implement things slightly based on some preliminary feedback before 
sending it out to the wider community.


Thanks,
-Maran


Paolo

Re: [PATCH v7 0/7] KVM: x86: Allow Qemu/KVM to use PVH entry point

2018-05-23 Thread Maran Wilson


On 5/18/2018 4:31 AM, Paolo Bonzini wrote:

On 16/05/2018 22:27, Maran Wilson wrote:

Friendly ping. I am hopeful one of the x86 and/or KVM maintainers has a
few cycles to spare to look this over.

And thanks to everyone who has helped thus far by providing valuable
feedback and reviewing.

    https://lkml.org/lkml/2018/4/16/1002

KVM bits look fine.  This would be the right time to post the QEMU
patches...


Thanks Paolo. Yes, we will have the Qemu patches out soon. It is being 
actively worked. We have had one implementation in place, but decided to 
re-implement things slightly based on some preliminary feedback before 
sending it out to the wider community.


Thanks,
-Maran


Paolo

[RFC PATCH v3 4/9] fs/btrfs/extent_io: update usage of zone modifiers

2018-05-23 Thread Huaisheng Ye

From: Huaisheng Ye 

Use __GFP_ZONE_MASK to replace (__GFP_DMA32 | __GFP_HIGHMEM).

In function alloc_extent_state, it is obvious that __GFP_DMA is not
the expecting zone type.

___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 have been deleted from GFP
bitmasks, the bottom three bits of GFP mask is reserved for storing
encoded zone number.
__GFP_DMA, __GFP_HIGHMEM and __GFP_DMA32 should not be operated with
each others by OR.

Use GFP_NORMAL() to clear bottom 3 bits of GFP bitmaks.

Signed-off-by: Huaisheng Ye 
Cc: Chris Mason 
Cc: Josef Bacik 
Cc: David Sterba 
Cc: Christoph Hellwig 
---
 fs/btrfs/extent_io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index e99b329..f41fc61 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -220,7 +220,7 @@ static struct extent_state *alloc_extent_state(gfp_t mask)
 * The given mask might be not appropriate for the slab allocator,
 * drop the unsupported bits
 */
-   mask &= ~(__GFP_DMA32|__GFP_HIGHMEM);
+   mask = GFP_NORMAL(mask);
state = kmem_cache_alloc(extent_state_cache, mask);
if (!state)
return state;
-- 
1.8.3.1

[RFC PATCH v3 4/9] fs/btrfs/extent_io: update usage of zone modifiers

2018-05-23 Thread Huaisheng Ye

From: Huaisheng Ye 

Use __GFP_ZONE_MASK to replace (__GFP_DMA32 | __GFP_HIGHMEM).

In function alloc_extent_state, it is obvious that __GFP_DMA is not
the expecting zone type.

___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 have been deleted from GFP
bitmasks, the bottom three bits of GFP mask is reserved for storing
encoded zone number.
__GFP_DMA, __GFP_HIGHMEM and __GFP_DMA32 should not be operated with
each others by OR.

Use GFP_NORMAL() to clear bottom 3 bits of GFP bitmaks.

Signed-off-by: Huaisheng Ye 
Cc: Chris Mason 
Cc: Josef Bacik 
Cc: David Sterba 
Cc: Christoph Hellwig 
---
 fs/btrfs/extent_io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index e99b329..f41fc61 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -220,7 +220,7 @@ static struct extent_state *alloc_extent_state(gfp_t mask)
 * The given mask might be not appropriate for the slab allocator,
 * drop the unsupported bits
 */
-   mask &= ~(__GFP_DMA32|__GFP_HIGHMEM);
+   mask = GFP_NORMAL(mask);
state = kmem_cache_alloc(extent_state_cache, mask);
if (!state)
return state;
-- 
1.8.3.1

Re: [PATCH v3 7/9] dax: report bytes remaining in dax_iomap_actor()

2018-05-23 Thread Dan Williams

On Wed, May 23, 2018 at 9:47 AM, Ross Zwisler
 wrote:
> On Wed, May 23, 2018 at 09:39:04AM -0700, Dan Williams wrote:
>> On Wed, May 23, 2018 at 9:34 AM, Ross Zwisler
>>  wrote:
>> > On Thu, May 03, 2018 at 05:06:42PM -0700, Dan Williams wrote:
>> >> In preparation for protecting the dax read(2) path from media errors
>> >> with copy_to_iter_mcsafe() (via dax_copy_to_iter()), convert the
>> >> implementation to report the bytes successfully transferred.
>> >>
>> >> Cc: 
>> >> Cc: Ingo Molnar 
>> >> Cc: Borislav Petkov 
>> >> Cc: Tony Luck 
>> >> Cc: Al Viro 
>> >> Cc: Thomas Gleixner 
>> >> Cc: Andy Lutomirski 
>> >> Cc: Peter Zijlstra 
>> >> Cc: Andrew Morton 
>> >> Cc: Linus Torvalds 
>> >> Signed-off-by: Dan Williams 
>> >> ---
>> >>  fs/dax.c |   20 +++-
>> >>  1 file changed, 11 insertions(+), 9 deletions(-)
>> >>
>> >> diff --git a/fs/dax.c b/fs/dax.c
>> >> index a64afdf7ec0d..34a2d435ae4b 100644
>> >> --- a/fs/dax.c
>> >> +++ b/fs/dax.c
>> >> @@ -991,6 +991,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, 
>> >> loff_t length, void *data,
>> >>   struct iov_iter *iter = data;
>> >>   loff_t end = pos + length, done = 0;
>> >>   ssize_t ret = 0;
>> >> + size_t xfer;
>> >>   int id;
>> >>
>> >>   if (iov_iter_rw(iter) == READ) {
>> >> @@ -1054,19 +1055,20 @@ dax_iomap_actor(struct inode *inode, loff_t pos, 
>> >> loff_t length, void *data,
>> >>* vfs_write(), depending on which operation we are doing.
>> >>*/
>> >>   if (iov_iter_rw(iter) == WRITE)
>> >> - map_len = dax_copy_from_iter(dax_dev, pgoff, kaddr,
>> >> + xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr,
>> >>   map_len, iter);
>> >>   else
>> >> - map_len = dax_copy_to_iter(dax_dev, pgoff, kaddr,
>> >> + xfer = dax_copy_to_iter(dax_dev, pgoff, kaddr,
>> >>   map_len, iter);
>> >> - if (map_len <= 0) {
>> >> - ret = map_len ? map_len : -EFAULT;
>> >> - break;
>> >> - }
>> >>
>> >> - pos += map_len;
>> >> - length -= map_len;
>> >> - done += map_len;
>> >> + pos += xfer;
>> >> + length -= xfer;
>> >> + done += xfer;
>> >> +
>> >> + if (xfer == 0)
>> >> + ret = -EFAULT;
>> >> + if (xfer < map_len)
>> >> + break;
>> >
>> > I'm confused by this error handling.  So if we hit an error on a given iov 
>> > and
>> > we don't transfer the expected number of bytes, we have two cases:
>> >
>> > 1) We transferred *something* on this iov but not everything - return 
>> > success.
>> > 2) We didn't transfer anything on this iov - return -EFAULT.
>> >
>> > Both of these are true regardless of data transferred on previous iovs.
>> >
>> > Why the distinction?  If a given iov is interrupted, regardless of whether 
>> > it
>> > transferred 0 bytes or 1, shouldn't the error path be the same?
>>
>> This is is the semantics of read(2) / write(2). Quoting the pwrite man page:
>>
>>Note that is not an error for  a  successful  call  to
>> transfer  fewer  bytes  than
>>requested (see read(2) and write(2)).
>
> Consider this case:
>
> I have 4 IOVs, each of a full page.  The first three transfer their full page,
> but on the third we hit an error.
>
> If we transferred 0 bytes in the fourth page, we'll return -EFAULT.
>
> If we transferred 1 byte in the fourth page, we'll return the total length
> transferred, so 3 pages + 1 byte.
>
> Why?  pwrite(2) says it returns the number of bytes written, which can be less
> than the total requested.  Why not just return the length transferred in both
> cases, instead of returning -EFAULT for one of them?

Ah, now I see. Yes, that's a bug. Once we have successfully completed
any iovec we should be returning bytes transferred not -EFAULT.

Re: [PATCH v3 7/9] dax: report bytes remaining in dax_iomap_actor()

2018-05-23 Thread Dan Williams

On Wed, May 23, 2018 at 9:47 AM, Ross Zwisler
 wrote:
> On Wed, May 23, 2018 at 09:39:04AM -0700, Dan Williams wrote:
>> On Wed, May 23, 2018 at 9:34 AM, Ross Zwisler
>>  wrote:
>> > On Thu, May 03, 2018 at 05:06:42PM -0700, Dan Williams wrote:
>> >> In preparation for protecting the dax read(2) path from media errors
>> >> with copy_to_iter_mcsafe() (via dax_copy_to_iter()), convert the
>> >> implementation to report the bytes successfully transferred.
>> >>
>> >> Cc: 
>> >> Cc: Ingo Molnar 
>> >> Cc: Borislav Petkov 
>> >> Cc: Tony Luck 
>> >> Cc: Al Viro 
>> >> Cc: Thomas Gleixner 
>> >> Cc: Andy Lutomirski 
>> >> Cc: Peter Zijlstra 
>> >> Cc: Andrew Morton 
>> >> Cc: Linus Torvalds 
>> >> Signed-off-by: Dan Williams 
>> >> ---
>> >>  fs/dax.c |   20 +++-
>> >>  1 file changed, 11 insertions(+), 9 deletions(-)
>> >>
>> >> diff --git a/fs/dax.c b/fs/dax.c
>> >> index a64afdf7ec0d..34a2d435ae4b 100644
>> >> --- a/fs/dax.c
>> >> +++ b/fs/dax.c
>> >> @@ -991,6 +991,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, 
>> >> loff_t length, void *data,
>> >>   struct iov_iter *iter = data;
>> >>   loff_t end = pos + length, done = 0;
>> >>   ssize_t ret = 0;
>> >> + size_t xfer;
>> >>   int id;
>> >>
>> >>   if (iov_iter_rw(iter) == READ) {
>> >> @@ -1054,19 +1055,20 @@ dax_iomap_actor(struct inode *inode, loff_t pos, 
>> >> loff_t length, void *data,
>> >>* vfs_write(), depending on which operation we are doing.
>> >>*/
>> >>   if (iov_iter_rw(iter) == WRITE)
>> >> - map_len = dax_copy_from_iter(dax_dev, pgoff, kaddr,
>> >> + xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr,
>> >>   map_len, iter);
>> >>   else
>> >> - map_len = dax_copy_to_iter(dax_dev, pgoff, kaddr,
>> >> + xfer = dax_copy_to_iter(dax_dev, pgoff, kaddr,
>> >>   map_len, iter);
>> >> - if (map_len <= 0) {
>> >> - ret = map_len ? map_len : -EFAULT;
>> >> - break;
>> >> - }
>> >>
>> >> - pos += map_len;
>> >> - length -= map_len;
>> >> - done += map_len;
>> >> + pos += xfer;
>> >> + length -= xfer;
>> >> + done += xfer;
>> >> +
>> >> + if (xfer == 0)
>> >> + ret = -EFAULT;
>> >> + if (xfer < map_len)
>> >> + break;
>> >
>> > I'm confused by this error handling.  So if we hit an error on a given iov 
>> > and
>> > we don't transfer the expected number of bytes, we have two cases:
>> >
>> > 1) We transferred *something* on this iov but not everything - return 
>> > success.
>> > 2) We didn't transfer anything on this iov - return -EFAULT.
>> >
>> > Both of these are true regardless of data transferred on previous iovs.
>> >
>> > Why the distinction?  If a given iov is interrupted, regardless of whether 
>> > it
>> > transferred 0 bytes or 1, shouldn't the error path be the same?
>>
>> This is is the semantics of read(2) / write(2). Quoting the pwrite man page:
>>
>>Note that is not an error for  a  successful  call  to
>> transfer  fewer  bytes  than
>>requested (see read(2) and write(2)).
>
> Consider this case:
>
> I have 4 IOVs, each of a full page.  The first three transfer their full page,
> but on the third we hit an error.
>
> If we transferred 0 bytes in the fourth page, we'll return -EFAULT.
>
> If we transferred 1 byte in the fourth page, we'll return the total length
> transferred, so 3 pages + 1 byte.
>
> Why?  pwrite(2) says it returns the number of bytes written, which can be less
> than the total requested.  Why not just return the length transferred in both
> cases, instead of returning -EFAULT for one of them?

Ah, now I see. Yes, that's a bug. Once we have successfully completed
any iovec we should be returning bytes transferred not -EFAULT.

[RFC PATCH v3 3/9] drivers/xen/swiotlb-xen: update usage of zone modifiers

2018-05-23 Thread Huaisheng Ye

From: Huaisheng Ye 

Use __GFP_ZONE_MASK to replace (__GFP_DMA | __GFP_HIGHMEM).

In function xen_swiotlb_alloc_coherent, it is obvious that __GFP_DMA32
is not the expecting zone type.

___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 have been deleted from GFP
bitmasks, the bottom three bits of GFP mask is reserved for storing
encoded zone number.
__GFP_DMA, __GFP_HIGHMEM and __GFP_DMA32 should not be operated with
each others by OR.

Use GFP_NORMAL() to clear bottom 3 bits of GFP bitmaks.

Signed-off-by: Huaisheng Ye 
Cc: Konrad Rzeszutek Wilk 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Christoph Hellwig 
---
 drivers/xen/swiotlb-xen.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index e1c6089..359 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -301,7 +301,7 @@ int __ref xen_swiotlb_init(int verbose, bool early)
* machine physical layout.  We can't allocate highmem
* because we can't return a pointer to it.
*/
-   flags &= ~(__GFP_DMA | __GFP_HIGHMEM);
+   flags = GFP_NORMAL(flags);
 
/* On ARM this function returns an ioremap'ped virtual address for
 * which virt_to_phys doesn't return the corresponding physical
-- 
1.8.3.1

[RFC PATCH v3 3/9] drivers/xen/swiotlb-xen: update usage of zone modifiers

2018-05-23 Thread Huaisheng Ye

From: Huaisheng Ye 

Use __GFP_ZONE_MASK to replace (__GFP_DMA | __GFP_HIGHMEM).

In function xen_swiotlb_alloc_coherent, it is obvious that __GFP_DMA32
is not the expecting zone type.

___GFP_DMA, ___GFP_HIGHMEM and ___GFP_DMA32 have been deleted from GFP
bitmasks, the bottom three bits of GFP mask is reserved for storing
encoded zone number.
__GFP_DMA, __GFP_HIGHMEM and __GFP_DMA32 should not be operated with
each others by OR.

Use GFP_NORMAL() to clear bottom 3 bits of GFP bitmaks.

Signed-off-by: Huaisheng Ye 
Cc: Konrad Rzeszutek Wilk 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Christoph Hellwig 
---
 drivers/xen/swiotlb-xen.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index e1c6089..359 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -301,7 +301,7 @@ int __ref xen_swiotlb_init(int verbose, bool early)
* machine physical layout.  We can't allocate highmem
* because we can't return a pointer to it.
*/
-   flags &= ~(__GFP_DMA | __GFP_HIGHMEM);
+   flags = GFP_NORMAL(flags);
 
/* On ARM this function returns an ioremap'ped virtual address for
 * which virt_to_phys doesn't return the corresponding physical
-- 
1.8.3.1

Re: [PATCH 2/4] arcnet: com20020: bindings for smsc com20020

2018-05-23 Thread Rob Herring

On Thu, May 17, 2018 at 03:06:26PM +0200, Andrea Greco wrote:
> From: Andrea Greco 
> 
> Add devicetree bindings for smsc com20020
> 
> Signed-off-by: Andrea Greco 
> ---
>  .../devicetree/bindings/net/smsc-com20020.txt   | 21 
> +
>  1 file changed, 21 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/smsc-com20020.txt

One typo, otherwise:

Reviewed-by: Rob Herring 

> 
> diff --git a/Documentation/devicetree/bindings/net/smsc-com20020.txt 
> b/Documentation/devicetree/bindings/net/smsc-com20020.txt
> new file mode 100644
> index ..92360b054873
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/smsc-com20020.txt
> @@ -0,0 +1,21 @@
> +SMSC com20020 Arcnet network controller
> +
> +Required propelty:

property

> +- timeout-ns: Arcnet bus timeout, Idle Time (328000 - 20500)
> +- bus-speed-bps: Arcnet bus speed (1000 - 156250)
> +- smsc,xtal-mhz: External oscillator frequency
> +- smsc,backplane-enabled: Controller use backplane mode
> +- reset-gpios: Chip reset pin
> +- interrupts: Should contain controller interrupt
> +
> +arcnet@2800 {
> +compatible = "smsc,com20020";
> +
> + timeout-ns = <20500>;
> + bus-speed-bps = <1000>;
> + smsc,xtal-mhz = <20>;
> + smsc,backplane-enabled;
> +
> + reset-gpios = < 21 GPIO_ACTIVE_LOW>;
> + interrupts = < 10 GPIO_ACTIVE_LOW>;
> +};
> -- 
> 2.14.3
>

Re: [PATCH 2/4] arcnet: com20020: bindings for smsc com20020

2018-05-23 Thread Rob Herring

On Thu, May 17, 2018 at 03:06:26PM +0200, Andrea Greco wrote:
> From: Andrea Greco 
> 
> Add devicetree bindings for smsc com20020
> 
> Signed-off-by: Andrea Greco 
> ---
>  .../devicetree/bindings/net/smsc-com20020.txt   | 21 
> +
>  1 file changed, 21 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/smsc-com20020.txt

One typo, otherwise:

Reviewed-by: Rob Herring 

> 
> diff --git a/Documentation/devicetree/bindings/net/smsc-com20020.txt 
> b/Documentation/devicetree/bindings/net/smsc-com20020.txt
> new file mode 100644
> index ..92360b054873
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/smsc-com20020.txt
> @@ -0,0 +1,21 @@
> +SMSC com20020 Arcnet network controller
> +
> +Required propelty:

property

> +- timeout-ns: Arcnet bus timeout, Idle Time (328000 - 20500)
> +- bus-speed-bps: Arcnet bus speed (1000 - 156250)
> +- smsc,xtal-mhz: External oscillator frequency
> +- smsc,backplane-enabled: Controller use backplane mode
> +- reset-gpios: Chip reset pin
> +- interrupts: Should contain controller interrupt
> +
> +arcnet@2800 {
> +compatible = "smsc,com20020";
> +
> + timeout-ns = <20500>;
> + bus-speed-bps = <1000>;
> + smsc,xtal-mhz = <20>;
> + smsc,backplane-enabled;
> +
> + reset-gpios = < 21 GPIO_ACTIVE_LOW>;
> + interrupts = < 10 GPIO_ACTIVE_LOW>;
> +};
> -- 
> 2.14.3
>

Re: [PATCH v3 7/9] dax: report bytes remaining in dax_iomap_actor()

2018-05-23 Thread Ross Zwisler

On Wed, May 23, 2018 at 09:39:04AM -0700, Dan Williams wrote:
> On Wed, May 23, 2018 at 9:34 AM, Ross Zwisler
>  wrote:
> > On Thu, May 03, 2018 at 05:06:42PM -0700, Dan Williams wrote:
> >> In preparation for protecting the dax read(2) path from media errors
> >> with copy_to_iter_mcsafe() (via dax_copy_to_iter()), convert the
> >> implementation to report the bytes successfully transferred.
> >>
> >> Cc: 
> >> Cc: Ingo Molnar 
> >> Cc: Borislav Petkov 
> >> Cc: Tony Luck 
> >> Cc: Al Viro 
> >> Cc: Thomas Gleixner 
> >> Cc: Andy Lutomirski 
> >> Cc: Peter Zijlstra 
> >> Cc: Andrew Morton 
> >> Cc: Linus Torvalds 
> >> Signed-off-by: Dan Williams 
> >> ---
> >>  fs/dax.c |   20 +++-
> >>  1 file changed, 11 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/fs/dax.c b/fs/dax.c
> >> index a64afdf7ec0d..34a2d435ae4b 100644
> >> --- a/fs/dax.c
> >> +++ b/fs/dax.c
> >> @@ -991,6 +991,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, 
> >> loff_t length, void *data,
> >>   struct iov_iter *iter = data;
> >>   loff_t end = pos + length, done = 0;
> >>   ssize_t ret = 0;
> >> + size_t xfer;
> >>   int id;
> >>
> >>   if (iov_iter_rw(iter) == READ) {
> >> @@ -1054,19 +1055,20 @@ dax_iomap_actor(struct inode *inode, loff_t pos, 
> >> loff_t length, void *data,
> >>* vfs_write(), depending on which operation we are doing.
> >>*/
> >>   if (iov_iter_rw(iter) == WRITE)
> >> - map_len = dax_copy_from_iter(dax_dev, pgoff, kaddr,
> >> + xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr,
> >>   map_len, iter);
> >>   else
> >> - map_len = dax_copy_to_iter(dax_dev, pgoff, kaddr,
> >> + xfer = dax_copy_to_iter(dax_dev, pgoff, kaddr,
> >>   map_len, iter);
> >> - if (map_len <= 0) {
> >> - ret = map_len ? map_len : -EFAULT;
> >> - break;
> >> - }
> >>
> >> - pos += map_len;
> >> - length -= map_len;
> >> - done += map_len;
> >> + pos += xfer;
> >> + length -= xfer;
> >> + done += xfer;
> >> +
> >> + if (xfer == 0)
> >> + ret = -EFAULT;
> >> + if (xfer < map_len)
> >> + break;
> >
> > I'm confused by this error handling.  So if we hit an error on a given iov 
> > and
> > we don't transfer the expected number of bytes, we have two cases:
> >
> > 1) We transferred *something* on this iov but not everything - return 
> > success.
> > 2) We didn't transfer anything on this iov - return -EFAULT.
> >
> > Both of these are true regardless of data transferred on previous iovs.
> >
> > Why the distinction?  If a given iov is interrupted, regardless of whether 
> > it
> > transferred 0 bytes or 1, shouldn't the error path be the same?
> 
> This is is the semantics of read(2) / write(2). Quoting the pwrite man page:
> 
>Note that is not an error for  a  successful  call  to
> transfer  fewer  bytes  than
>requested (see read(2) and write(2)).

Consider this case:

I have 4 IOVs, each of a full page.  The first three transfer their full page,
but on the third we hit an error.

If we transferred 0 bytes in the fourth page, we'll return -EFAULT.

If we transferred 1 byte in the fourth page, we'll return the total length
transferred, so 3 pages + 1 byte.

Why?  pwrite(2) says it returns the number of bytes written, which can be less
than the total requested.  Why not just return the length transferred in both
cases, instead of returning -EFAULT for one of them?

Re: [PATCH v3 7/9] dax: report bytes remaining in dax_iomap_actor()

2018-05-23 Thread Ross Zwisler

On Wed, May 23, 2018 at 09:39:04AM -0700, Dan Williams wrote:
> On Wed, May 23, 2018 at 9:34 AM, Ross Zwisler
>  wrote:
> > On Thu, May 03, 2018 at 05:06:42PM -0700, Dan Williams wrote:
> >> In preparation for protecting the dax read(2) path from media errors
> >> with copy_to_iter_mcsafe() (via dax_copy_to_iter()), convert the
> >> implementation to report the bytes successfully transferred.
> >>
> >> Cc: 
> >> Cc: Ingo Molnar 
> >> Cc: Borislav Petkov 
> >> Cc: Tony Luck 
> >> Cc: Al Viro 
> >> Cc: Thomas Gleixner 
> >> Cc: Andy Lutomirski 
> >> Cc: Peter Zijlstra 
> >> Cc: Andrew Morton 
> >> Cc: Linus Torvalds 
> >> Signed-off-by: Dan Williams 
> >> ---
> >>  fs/dax.c |   20 +++-
> >>  1 file changed, 11 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/fs/dax.c b/fs/dax.c
> >> index a64afdf7ec0d..34a2d435ae4b 100644
> >> --- a/fs/dax.c
> >> +++ b/fs/dax.c
> >> @@ -991,6 +991,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, 
> >> loff_t length, void *data,
> >>   struct iov_iter *iter = data;
> >>   loff_t end = pos + length, done = 0;
> >>   ssize_t ret = 0;
> >> + size_t xfer;
> >>   int id;
> >>
> >>   if (iov_iter_rw(iter) == READ) {
> >> @@ -1054,19 +1055,20 @@ dax_iomap_actor(struct inode *inode, loff_t pos, 
> >> loff_t length, void *data,
> >>* vfs_write(), depending on which operation we are doing.
> >>*/
> >>   if (iov_iter_rw(iter) == WRITE)
> >> - map_len = dax_copy_from_iter(dax_dev, pgoff, kaddr,
> >> + xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr,
> >>   map_len, iter);
> >>   else
> >> - map_len = dax_copy_to_iter(dax_dev, pgoff, kaddr,
> >> + xfer = dax_copy_to_iter(dax_dev, pgoff, kaddr,
> >>   map_len, iter);
> >> - if (map_len <= 0) {
> >> - ret = map_len ? map_len : -EFAULT;
> >> - break;
> >> - }
> >>
> >> - pos += map_len;
> >> - length -= map_len;
> >> - done += map_len;
> >> + pos += xfer;
> >> + length -= xfer;
> >> + done += xfer;
> >> +
> >> + if (xfer == 0)
> >> + ret = -EFAULT;
> >> + if (xfer < map_len)
> >> + break;
> >
> > I'm confused by this error handling.  So if we hit an error on a given iov 
> > and
> > we don't transfer the expected number of bytes, we have two cases:
> >
> > 1) We transferred *something* on this iov but not everything - return 
> > success.
> > 2) We didn't transfer anything on this iov - return -EFAULT.
> >
> > Both of these are true regardless of data transferred on previous iovs.
> >
> > Why the distinction?  If a given iov is interrupted, regardless of whether 
> > it
> > transferred 0 bytes or 1, shouldn't the error path be the same?
> 
> This is is the semantics of read(2) / write(2). Quoting the pwrite man page:
> 
>Note that is not an error for  a  successful  call  to
> transfer  fewer  bytes  than
>requested (see read(2) and write(2)).

Consider this case:

I have 4 IOVs, each of a full page.  The first three transfer their full page,
but on the third we hit an error.

If we transferred 0 bytes in the fourth page, we'll return -EFAULT.

If we transferred 1 byte in the fourth page, we'll return the total length
transferred, so 3 pages + 1 byte.

Why?  pwrite(2) says it returns the number of bytes written, which can be less
than the total requested.  Why not just return the length transferred in both
cases, instead of returning -EFAULT for one of them?

Re: [PATCH 1/4] rcu: Speed up calling of RCU tasks callbacks

2018-05-23 Thread Steven Rostedt

On Wed, 23 May 2018 08:57:34 -0700
"Paul E. McKenney"  wrote:

> On Tue, May 22, 2018 at 11:38:12PM -0700, Joel Fernandes wrote:
> > From: "Joel Fernandes (Google)" 
> > 
> > RCU tasks callbacks can take at least 1 second before the callbacks are
> > executed. This happens even if the hold-out tasks enter their quiescent 
> > states
> > quickly. I noticed this when I was testing trampoline callback execution.
> > 
> > To test the trampoline freeing, I wrote a simple script:
> > cd /sys/kernel/debug/tracing/
> > echo '__schedule_bug:traceon' > set_ftrace_filter;
> > echo '!__schedule_bug:traceon' > set_ftrace_filter;
> > 
> > In the background I had simple bash while loop:
> > while [ 1 ]; do x=1; done &
> > 
> > Total time of completion of above commands in seconds:
> > 
> > With this patch:
> > real0m0.179s
> > user0m0.000s
> > sys 0m0.054s
> > 
> > Without this patch:
> > real0m1.098s
> > user0m0.000s
> > sys 0m0.053s
> > 
> > That's a greater than 6X speed up in performance. In order to accomplish
> > this, I am waiting for HZ/10 time before entering the hold-out checking
> > loop. The loop still preserves its checking of held tasks every 1 second
> > as before, in case this first test doesn't succeed.
> > 
> > Cc: Steven Rostedt   
> 
> Given an ack from Steven, I would be happy to take this, give or take
> some nits below.

I'm currently testing it, and trying to understand it better.

> 
>   Thanx, Paul
> 
> > Cc: Peter Zilstra 
> > Cc: Ingo Molnar 
> > Cc: Boqun Feng 
> > Cc: Paul McKenney 
> > Cc: byungchul.p...@lge.com
> > Cc: kernel-t...@android.com
> > Signed-off-by: Joel Fernandes (Google) 
> > ---
> >  kernel/rcu/update.c | 12 +++-
> >  1 file changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > index 5783bdf86e5a..a28698e44b08 100644
> > --- a/kernel/rcu/update.c
> > +++ b/kernel/rcu/update.c
> > @@ -743,6 +743,12 @@ static int __noreturn rcu_tasks_kthread(void *arg)
> >  */
> > synchronize_srcu(_rcu_exit_srcu);
> > 
> > +   /*
> > +* Wait a little bit incase held tasks are released  
> 
>   in case
> 
> > +* during their next timer ticks.
> > +*/
> > +   schedule_timeout_interruptible(HZ/10);
> > +
> > /*
> >  * Each pass through the following loop scans the list
> >  * of holdout tasks, removing any that are no longer
> > @@ -755,7 +761,6 @@ static int __noreturn rcu_tasks_kthread(void *arg)
> > int rtst;
> > struct task_struct *t1;
> > 
> > -   schedule_timeout_interruptible(HZ);
> > rtst = READ_ONCE(rcu_task_stall_timeout);
> > needreport = rtst > 0 &&
> >  time_after(jiffies, lastreport + rtst);
> > @@ -768,6 +773,11 @@ static int __noreturn rcu_tasks_kthread(void *arg)
> > check_holdout_task(t, needreport, );
> > cond_resched();
> > }
> > +
> > +   if (list_empty(_tasks_holdouts))
> > +   break;
> > +
> > +   schedule_timeout_interruptible(HZ);  

Why is this a full second wait and not the HZ/10 like the others?

-- Steve

> 
> Is there a better way to do this?  Can this be converted into a for-loop?
> Alternatively, would it make sense to have a firsttime local variable
> initialized to true, to keep the schedule_timeout_interruptible() at
> the beginning of the loop, but skip it on the first pass through the loop?
> 
> Don't get me wrong, what you have looks functionally correct, but
> duplicating the condition might cause problems later on, for example,
> should a bug fix be needed in the condition.
> 
> > }
> > 
> > /*
> > -- 
> > 2.17.0.441.gb46fe60e1d-goog
> >

Re: [PATCH 1/4] rcu: Speed up calling of RCU tasks callbacks

2018-05-23 Thread Steven Rostedt

On Wed, 23 May 2018 08:57:34 -0700
"Paul E. McKenney"  wrote:

> On Tue, May 22, 2018 at 11:38:12PM -0700, Joel Fernandes wrote:
> > From: "Joel Fernandes (Google)" 
> > 
> > RCU tasks callbacks can take at least 1 second before the callbacks are
> > executed. This happens even if the hold-out tasks enter their quiescent 
> > states
> > quickly. I noticed this when I was testing trampoline callback execution.
> > 
> > To test the trampoline freeing, I wrote a simple script:
> > cd /sys/kernel/debug/tracing/
> > echo '__schedule_bug:traceon' > set_ftrace_filter;
> > echo '!__schedule_bug:traceon' > set_ftrace_filter;
> > 
> > In the background I had simple bash while loop:
> > while [ 1 ]; do x=1; done &
> > 
> > Total time of completion of above commands in seconds:
> > 
> > With this patch:
> > real0m0.179s
> > user0m0.000s
> > sys 0m0.054s
> > 
> > Without this patch:
> > real0m1.098s
> > user0m0.000s
> > sys 0m0.053s
> > 
> > That's a greater than 6X speed up in performance. In order to accomplish
> > this, I am waiting for HZ/10 time before entering the hold-out checking
> > loop. The loop still preserves its checking of held tasks every 1 second
> > as before, in case this first test doesn't succeed.
> > 
> > Cc: Steven Rostedt   
> 
> Given an ack from Steven, I would be happy to take this, give or take
> some nits below.

I'm currently testing it, and trying to understand it better.

> 
>   Thanx, Paul
> 
> > Cc: Peter Zilstra 
> > Cc: Ingo Molnar 
> > Cc: Boqun Feng 
> > Cc: Paul McKenney 
> > Cc: byungchul.p...@lge.com
> > Cc: kernel-t...@android.com
> > Signed-off-by: Joel Fernandes (Google) 
> > ---
> >  kernel/rcu/update.c | 12 +++-
> >  1 file changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > index 5783bdf86e5a..a28698e44b08 100644
> > --- a/kernel/rcu/update.c
> > +++ b/kernel/rcu/update.c
> > @@ -743,6 +743,12 @@ static int __noreturn rcu_tasks_kthread(void *arg)
> >  */
> > synchronize_srcu(_rcu_exit_srcu);
> > 
> > +   /*
> > +* Wait a little bit incase held tasks are released  
> 
>   in case
> 
> > +* during their next timer ticks.
> > +*/
> > +   schedule_timeout_interruptible(HZ/10);
> > +
> > /*
> >  * Each pass through the following loop scans the list
> >  * of holdout tasks, removing any that are no longer
> > @@ -755,7 +761,6 @@ static int __noreturn rcu_tasks_kthread(void *arg)
> > int rtst;
> > struct task_struct *t1;
> > 
> > -   schedule_timeout_interruptible(HZ);
> > rtst = READ_ONCE(rcu_task_stall_timeout);
> > needreport = rtst > 0 &&
> >  time_after(jiffies, lastreport + rtst);
> > @@ -768,6 +773,11 @@ static int __noreturn rcu_tasks_kthread(void *arg)
> > check_holdout_task(t, needreport, );
> > cond_resched();
> > }
> > +
> > +   if (list_empty(_tasks_holdouts))
> > +   break;
> > +
> > +   schedule_timeout_interruptible(HZ);  

Why is this a full second wait and not the HZ/10 like the others?

-- Steve

> 
> Is there a better way to do this?  Can this be converted into a for-loop?
> Alternatively, would it make sense to have a firsttime local variable
> initialized to true, to keep the schedule_timeout_interruptible() at
> the beginning of the loop, but skip it on the first pass through the loop?
> 
> Don't get me wrong, what you have looks functionally correct, but
> duplicating the condition might cause problems later on, for example,
> should a bug fix be needed in the condition.
> 
> > }
> > 
> > /*
> > -- 
> > 2.17.0.441.gb46fe60e1d-goog
> >

Re: [PATCH v3] usbip: dynamically allocate idev by nports found in sysfs

2018-05-23 Thread Shuah Khan

On 05/23/2018 03:22 AM, Michael Grzeschik wrote:
> As the amount of available ports varies by the kernels build
> configuration. To remove the limitation of the fixed 128 ports
> we allocate the amount of idevs by using the number we get
> from the kernel.
> 
> Signed-off-by: Michael Grzeschik 
> ---
> v1 -> v2: - reworked memory allocation into one calloc call
>   - added error path on allocation failure
> v2 -> v3: - moved check for available nports to beginning of function
> 

Hmm. With this patch I see a segfault when I run usbip port command.
I think this patch is incomplete and more changes are needed to the
code that references the idev array.

I can't take this patch.

thanks,
-- Shuah

Re: [PATCH v3] usbip: dynamically allocate idev by nports found in sysfs

2018-05-23 Thread Shuah Khan

On 05/23/2018 03:22 AM, Michael Grzeschik wrote:
> As the amount of available ports varies by the kernels build
> configuration. To remove the limitation of the fixed 128 ports
> we allocate the amount of idevs by using the number we get
> from the kernel.
> 
> Signed-off-by: Michael Grzeschik 
> ---
> v1 -> v2: - reworked memory allocation into one calloc call
>   - added error path on allocation failure
> v2 -> v3: - moved check for available nports to beginning of function
> 

Hmm. With this patch I see a segfault when I run usbip port command.
I think this patch is incomplete and more changes are needed to the
code that references the idev array.

I can't take this patch.

thanks,
-- Shuah

Re: [PATCH] of: overlay: validate offset from property fixups

2018-05-23 Thread Rob Herring

On Wed, May 16, 2018 at 09:19:51PM -0700, frowand.l...@gmail.com wrote:
> From: Frank Rowand 
> 
> The smatch static checker marks the data in offset as untrusted,
> leading it to warn:
> 
>   drivers/of/resolver.c:125 update_usages_of_a_phandle_reference()
>   error: buffer underflow 'prop->value' 's32min-s32max'
> 
> Add check to verify that offset is within the property data.
> 
> Reported-by: Dan Carpenter 
> Signed-off-by: Frank Rowand 
> ---
>  drivers/of/resolver.c | 5 +
>  1 file changed, 5 insertions(+)

Applied, thanks.

Rob

Re: [PATCH] of: overlay: validate offset from property fixups

2018-05-23 Thread Rob Herring

On Wed, May 16, 2018 at 09:19:51PM -0700, frowand.l...@gmail.com wrote:
> From: Frank Rowand 
> 
> The smatch static checker marks the data in offset as untrusted,
> leading it to warn:
> 
>   drivers/of/resolver.c:125 update_usages_of_a_phandle_reference()
>   error: buffer underflow 'prop->value' 's32min-s32max'
> 
> Add check to verify that offset is within the property data.
> 
> Reported-by: Dan Carpenter 
> Signed-off-by: Frank Rowand 
> ---
>  drivers/of/resolver.c | 5 +
>  1 file changed, 5 insertions(+)

Applied, thanks.

Rob

< 4 5 6 7 8 9 10 11 12 13 >

801 - 900 of 2226 matches

Mail list logo