date:20151028

Re: [RFD] Functional dependencies between devices

2015-10-28 Thread Mark Brown

On Wed, Oct 28, 2015 at 04:54:04PM +0100, Rafael J. Wysocki wrote:

> Information that is already available at the device registration time should
> be used at that time or it makes things harder to follow.

> But that really is a tradeoff.  If collecting that information requires too
> much effort, it may not be worth it.

For DT it's going to be a lot eaiser to reliably collect everything in
driver specific code, the property names to look at do follow
conventions but are driver defined.

signature.asc
Description: PGP signature

Re: [RFD] Functional dependencies between devices

2015-10-28 Thread Mark Brown

On Tue, Oct 27, 2015 at 04:24:14PM +0100, Rafael J. Wysocki wrote:

> So, the question to everybody is whether or not this sounds reasonable or 
> there
> are concerns about it and if so what they are.  At this point I mostly need to
> know if I'm not overlooking anything fundamental at the general level.

This seems like a good plan to me however I am concerned that only
allowing links to be created at device registration time will prove
restrictive - it means we're going to ignore anything we figure out
later on in the boot sequence.  I would be very surprised if we didn't
need that, either from things that get missed or from things that get
allocated dynamically at runtime on systems with flexible hardware, and
it'd also mean that systems can start to benefit from this for suspend
and resume without needing the updates to the firmware parsing support.

signature.asc
Description: PGP signature

Re: [PATCH] MAINTAINERS: Start using the 'reviewer' (R) tag

2015-10-28 Thread Javier Martinez Canillas

Hello Lee,

On Thu, Oct 29, 2015 at 12:56 AM, Krzysztof Kozlowski
 wrote:
> On 28.10.2015 23:38, Lee Jones wrote:
>> On Wed, 28 Oct 2015, Javier Martinez Canillas wrote:
>>> They are not maintainers according to your definition of maintainer
>>> that doesn't seem what most people agree with.
>>
>> "most people" so far are 3 people that I assume still want to be
>> Maintainers despite not actually conducting Maintainer duties, but
>> are rather Reviewers.  I also have 2 Acks for this patch, so thus far
>> that's 3 that agree and 3 that do not.  Unsurprisingly the ones that
>> agree are Maintainers and the ones who are not are (by my definition)
>> Reviewers -- go figure.
>>
>
> I am not sure on which side you put me finally. :)
> If there is a consensus among some more experienced developers that
> maintainer means branch and patches maintaining, then I won't see any
> problem with the patch nor with switching Samsung PMIC entries to review.
>
> In that case, that would be:
> Acked-by: Krzysztof Kozlowski 
>

Same for me, what I don't want is to have different meanings per
subsystems of what maintainers and reviewers mean since that could
confuse developers posting patches.

But if there is a kernel wide consensus and all subsystems entries are
going to be updated to use the same semantics and list as Reviewer to
people that don't keep git trees, then I'm OK with this patch and you
can add my:

Reviewed-by: Javier Martinez Canillas 

> Best regards,
> Krzysztof
>

Best regards,
Javier
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3] Input: tsc2005 - Add support for tsc2004

2015-10-28 Thread Michael Welling

Adds support for the i2c based tsc2004.

Due to the overlapping functionality of the tsc2004 and tsc2005
the common code was moved to a core driver (tsc200x-core).

Signed-off-by: Michael Welling 
---
v3: Splits the tsc2004 and tsc2005 into separate drivers with
with common routines in tsc200x-core.
v2: Fixes Kconfig based on report for 0-day build bot.
 .../bindings/input/touchscreen/tsc2004.txt |  38 +
 drivers/input/touchscreen/Kconfig  |  17 +
 drivers/input/touchscreen/Makefile |   2 +
 drivers/input/touchscreen/tsc2004.c|  73 ++
 drivers/input/touchscreen/tsc2005.c| 707 +-
 drivers/input/touchscreen/tsc200x-core.c   | 790 +
 drivers/input/touchscreen/tsc200x-core.h   |  13 +
 7 files changed, 938 insertions(+), 702 deletions(-)
 create mode 100644 
Documentation/devicetree/bindings/input/touchscreen/tsc2004.txt
 create mode 100644 drivers/input/touchscreen/tsc2004.c
 create mode 100644 drivers/input/touchscreen/tsc200x-core.c
 create mode 100644 drivers/input/touchscreen/tsc200x-core.h

diff --git a/Documentation/devicetree/bindings/input/touchscreen/tsc2004.txt 
b/Documentation/devicetree/bindings/input/touchscreen/tsc2004.txt
new file mode 100644
index 000..14a37fb
--- /dev/null
+++ b/Documentation/devicetree/bindings/input/touchscreen/tsc2004.txt
@@ -0,0 +1,38 @@
+* Texas Instruments tsc2004 touchscreen controller
+
+Required properties:
+ - compatible: "ti,tsc2004"
+ - interrupts: IRQ specifier
+ - vio-supply : Regulator specifier
+
+Optional properties:
+ - reset-gpios   : GPIO specifier
+ - ti,x-plate-ohms   : integer, resistance of the touchscreen's X 
plates
+   in ohm (defaults to 280)
+ - ti,esd-recovery-timeout-ms : integer, if the touchscreen does not respond 
after
+   the configured time (in milli seconds), the 
driver
+   will reset it. This is disabled by default.
+ - properties defined in touchscreen.txt
+
+Example:
+
+&i2c3 {
+   tsc2004@48 {
+   compatible = "ti,tsc2004";
+   reg = <0x48>;
+   vio-supply = <&vio>;
+
+   reset-gpios = <&gpio4 8 GPIO_ACTIVE_HIGH>;
+   interrupts-extended = <&gpio1 27 IRQ_TYPE_EDGE_RISING>;
+
+   touchscreen-fuzz-x = <4>;
+   touchscreen-fuzz-y = <7>;
+   touchscreen-fuzz-pressure = <2>;
+   touchscreen-size-x = <4096>;
+   touchscreen-size-y = <4096>;
+   touchscreen-max-pressure = <2048>;
+
+   ti,x-plate-ohms = <280>;
+   ti,esd-recovery-timeout-ms = <8000>;
+   };
+}
diff --git a/drivers/input/touchscreen/Kconfig 
b/drivers/input/touchscreen/Kconfig
index 80cc698..e574f8c 100644
--- a/drivers/input/touchscreen/Kconfig
+++ b/drivers/input/touchscreen/Kconfig
@@ -939,10 +939,27 @@ config TOUCHSCREEN_TSC_SERIO
  To compile this driver as a module, choose M here: the
  module will be called tsc40.
 
+config TOUCHSCREEN_TSC200X
+   tristate
+
+config TOUCHSCREEN_TSC2004
+   tristate "TSC2004 based touchscreens"
+   depends on I2C
+   select REGMAP_I2C
+   select TOUCHSCREEN_TSC200X
+   help
+ Say Y here if you have a TSC2004 based touchscreen.
+
+ If unsure, say N.
+
+ To compile this driver as a module, choose M here: the
+ module will be called tsc2004.
+
 config TOUCHSCREEN_TSC2005
tristate "TSC2005 based touchscreens"
depends on SPI_MASTER
select REGMAP_SPI
+   select TOUCHSCREEN_TSC200X
help
  Say Y here if you have a TSC2005 based touchscreen.
 
diff --git a/drivers/input/touchscreen/Makefile 
b/drivers/input/touchscreen/Makefile
index 17435c7..810b047 100644
--- a/drivers/input/touchscreen/Makefile
+++ b/drivers/input/touchscreen/Makefile
@@ -69,6 +69,8 @@ obj-$(CONFIG_TOUCHSCREEN_TOUCHIT213)  += touchit213.o
 obj-$(CONFIG_TOUCHSCREEN_TOUCHRIGHT)   += touchright.o
 obj-$(CONFIG_TOUCHSCREEN_TOUCHWIN) += touchwin.o
 obj-$(CONFIG_TOUCHSCREEN_TSC_SERIO)+= tsc40.o
+obj-$(CONFIG_TOUCHSCREEN_TSC200X)  += tsc200x-core.o
+obj-$(CONFIG_TOUCHSCREEN_TSC2004)  += tsc2004.o
 obj-$(CONFIG_TOUCHSCREEN_TSC2005)  += tsc2005.o
 obj-$(CONFIG_TOUCHSCREEN_TSC2007)  += tsc2007.o
 obj-$(CONFIG_TOUCHSCREEN_UCB1400)  += ucb1400_ts.o
diff --git a/drivers/input/touchscreen/tsc2004.c 
b/drivers/input/touchscreen/tsc2004.c
new file mode 100644
index 000..01457a2
--- /dev/null
+++ b/drivers/input/touchscreen/tsc2004.c
@@ -0,0 +1,73 @@
+/*
+ * TSC2004 touchscreen driver
+ *
+ * Copyright (C) 2015 EMAC Inc.
+ * Copyright (C) 2015 QWERTY Embedded Design
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published

Re: [PATCH v2 06/10] usb/uvc: Support for V4L2_CTRL_WHICH_DEF_VAL

2015-10-28 Thread Laurent Pinchart

Hi Ricardo,

Thank you for the patch.

On Friday 21 August 2015 15:19:25 Ricardo Ribalda Delgado wrote:
> This driver does not use the control infrastructure.
> Add support for the new field which on structure
>  v4l2_ext_controls
> 
> Signed-off-by: Ricardo Ribalda Delgado 
> ---
>  drivers/media/usb/uvc/uvc_v4l2.c | 14 +-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/media/usb/uvc/uvc_v4l2.c
> b/drivers/media/usb/uvc/uvc_v4l2.c index 2764f43607c1..e6d3a1bcfa2f 100644
> --- a/drivers/media/usb/uvc/uvc_v4l2.c
> +++ b/drivers/media/usb/uvc/uvc_v4l2.c
> @@ -980,6 +980,7 @@ static int uvc_ioctl_g_ext_ctrls(struct file *file, void
> *fh, struct uvc_fh *handle = fh;
>   struct uvc_video_chain *chain = handle->chain;
>   struct v4l2_ext_control *ctrl = ctrls->controls;
> + struct v4l2_queryctrl qc;
>   unsigned int i;
>   int ret;
> 
> @@ -988,7 +989,14 @@ static int uvc_ioctl_g_ext_ctrls(struct file *file,
> void *fh, return ret;
> 
>   for (i = 0; i < ctrls->count; ++ctrl, ++i) {
> - ret = uvc_ctrl_get(chain, ctrl);
> + if (ctrls->which == V4L2_CTRL_WHICH_DEF_VAL) {
> + qc.id = ctrl->id;
> + ret = uvc_query_v4l2_ctrl(chain, &qc);

The uvc_ctrl_begin() call above locks chain->ctrl_mutex, and 
uvc_query_v4l2_ctrl() will then try to acquire the same lock. That's not a 
good idea :-)

I propose moving the ctrls->which check before the uvc_ctrl_begin() call and 
implement it as

if (ctrls->which == V4L2_CTRL_WHICH_DEF_VAL) {
for (i = 0; i < ctrls->count; ++ctrl, ++i) {
struct v4l2_queryctrl qc = { .id = ctrl->id };

ret = uvc_query_v4l2_ctrl(chain, &qc);
if (ret < 0) {
ctrls->error_idx = i;
return ret;
}

ctrl->value = qc.default_value;
}

return 0;
}

> + if (!ret)
> + ctrl->value = qc.default_value;
> + } else
> + ret = uvc_ctrl_get(chain, ctrl);
> +
>   if (ret < 0) {
>   uvc_ctrl_rollback(handle);
>   ctrls->error_idx = i;
> @@ -1010,6 +1018,10 @@ static int uvc_ioctl_s_try_ext_ctrls(struct uvc_fh
> *handle, unsigned int i;
>   int ret;
> 
> + /* Default value cannot be changed */
> + if (ctrls->which == V4L2_CTRL_WHICH_DEF_VAL)
> + return -EINVAL;
> +
>   ret = uvc_ctrl_begin(chain);
>   if (ret < 0)
>   return ret;

-- 
Regards,

Laurent Pinchart

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] sparc64: Fix numa distance values

2015-10-28 Thread Nitin Gupta

Orabug: 21896119

Use machine descriptor (MD) to get node latency
values instead of just using default values.

Testing:
On an T5-8 system with:
 - total nodes = 8
 - self latencies = 0x26d18
 - latency to other nodes = 0x3a598
   => latency ratio = ~1.5

output of numactl --hardware

 - before fix:

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  20  20  20  20  20  20  20
  1:  20  10  20  20  20  20  20  20
  2:  20  20  10  20  20  20  20  20
  3:  20  20  20  10  20  20  20  20
  4:  20  20  20  20  10  20  20  20
  5:  20  20  20  20  20  10  20  20
  6:  20  20  20  20  20  20  10  20
  7:  20  20  20  20  20  20  20  10

 - after fix:

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  15  15  15  15  15  15  15
  1:  15  10  15  15  15  15  15  15
  2:  15  15  10  15  15  15  15  15
  3:  15  15  15  10  15  15  15  15
  4:  15  15  15  15  10  15  15  15
  5:  15  15  15  15  15  10  15  15
  6:  15  15  15  15  15  15  10  15
  7:  15  15  15  15  15  15  15  10

Signed-off-by: Nitin Gupta 
Reviewed-by: Chris Hyser 
Reviewed-by: Santosh Shilimkar 
---
 arch/sparc/include/asm/topology_64.h |3 +
 arch/sparc/mm/init_64.c  |   70 +-
 2 files changed, 72 insertions(+), 1 deletions(-)

diff --git a/arch/sparc/include/asm/topology_64.h 
b/arch/sparc/include/asm/topology_64.h
index 01d1704..ed3dfdd 100644
--- a/arch/sparc/include/asm/topology_64.h
+++ b/arch/sparc/include/asm/topology_64.h
@@ -31,6 +31,9 @@ static inline int pcibus_to_node(struct pci_bus *pbus)
 cpu_all_mask : \
 cpumask_of_node(pcibus_to_node(bus)))
 
+extern int __node_distance(int, int);
+#define node_distance(a, b) __node_distance(a, b)
+
 #else /* CONFIG_NUMA */
 
 #include 
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 4ac88b7..3025bd5 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -93,6 +93,8 @@ static unsigned long cpu_pgsz_mask;
 static struct linux_prom64_registers pavail[MAX_BANKS];
 static int pavail_ents;
 
+u64 numa_latency[MAX_NUMNODES][MAX_NUMNODES];
+
 static int cmp_p64(const void *a, const void *b)
 {
const struct linux_prom64_registers *x = a, *y = b;
@@ -1157,6 +1159,48 @@ static struct mdesc_mlgroup * __init find_mlgroup(u64 
node)
return NULL;
 }
 
+int __node_distance(int from, int to)
+{
+   if ((from >= MAX_NUMNODES) || (to >= MAX_NUMNODES)) {
+   pr_warn("Returning default NUMA distance value for %d->%d\n",
+   from, to);
+   return (from == to) ? LOCAL_DISTANCE : REMOTE_DISTANCE;
+   }
+   return numa_latency[from][to];
+}
+
+static int find_best_numa_node_for_mlgroup(struct mdesc_mlgroup *grp)
+{
+   int i;
+
+   for (i = 0; i < MAX_NUMNODES; i++) {
+   struct node_mem_mask *n = &node_masks[i];
+
+   if ((grp->mask == n->mask) && (grp->match == n->val))
+   break;
+   }
+   return i;
+}
+
+static void find_numa_latencies_for_group(struct mdesc_handle *md, u64 grp,
+ int index)
+{
+   u64 arc;
+
+   mdesc_for_each_arc(arc, md, grp, MDESC_ARC_TYPE_FWD) {
+   int tnode;
+   u64 target = mdesc_arc_target(md, arc);
+   struct mdesc_mlgroup *m = find_mlgroup(target);
+
+   if (!m)
+   continue;
+   tnode = find_best_numa_node_for_mlgroup(m);
+   if (tnode == MAX_NUMNODES)
+   continue;
+   numa_latency[index][tnode] = m->latency;
+   }
+}
+
 static int __init numa_attach_mlgroup(struct mdesc_handle *md, u64 grp,
  int index)
 {
@@ -1220,9 +1264,16 @@ static int __init numa_parse_mdesc_group(struct 
mdesc_handle *md, u64 grp,
 static int __init numa_parse_mdesc(void)
 {
struct mdesc_handle *md = mdesc_grab();
-   int i, err, count;
+   int i, j, err, count;
u64 node;
 
+   /* Some sane defaults for numa latency values */
+   for (i = 0; i < MAX_NUMNODES; i++) {
+   for (j = 0; j < MAX_NUMNODES; j++)
+   numa_latency[i][j] = (i == j) ?
+   LOCAL_DISTANCE : REMOTE_DISTANCE;
+   }
+
node = mdesc_node_by_name(md, MDESC_NODE_NULL, "latency-groups");
if (node == MDESC_NODE_NULL) {
mdesc_release(md);
@@ -1245,6 +1296,23 @@ static int __init numa_parse_mdesc(void)
count++;
}
 
+   count = 0;
+   mdesc_for_each_node_by_name(md, node, "group") {
+   find_numa_latencies_for_group(md, node, count);
+   count++;
+   }
+
+   /* Normalize numa latency matrix according to ACPI SLIT spec. */
+   for (i = 0; i < MAX_NUMNODES; i++) {
+   u64 self_latency = numa_latency[i][i];
+
+   for (j = 0; j < MAX_NUMNODES; j++) {
+   num

Re: [PATCH 1/2] mm: mmap: Add new /proc tunable for mmap_base ASLR.

2015-10-28 Thread Jeffrey Vander Stoep

plain text this time...

> This all would be much cleaner if the arm architecture code were just to
> register the sysctl itself.
>
> As it sits this looks like a patchset that does not meaninfully bisect,
> and would result in code that is hard to trace and understand.

I believe the intent is to follow up with more architecture specific
patches to allow each architecture to define the number of bits to use
(min, max, and default) since these values are architecture dependent.
Arm64 patch should be forthcoming, and others after that. With that in
mind, would you still prefer to have the sysctl code in the
arm-specific patch?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/core/rcu 11/13] rculist: Make list_entry_rcu() use lockless_dereference()

2015-10-28 Thread Paul E. McKenney

On Wed, Oct 28, 2015 at 09:35:42PM +0100, Patrick Marlier wrote:
> 
> 
> On 10/28/2015 09:33 AM, Ingo Molnar wrote:
> >
> >* Tejun Heo  wrote:
> >
> >>Subject: writeback: don't use list_entry_rcu() for pointer offsetting in 
> >>bdi_split_work_to_wbs()
> >>
> >>bdi_split_work_to_wbs() uses list_for_each_entry_rcu_continue() to
> >>walk @bdi->wb_list.  To set up the initial iteration condition, it
> >>uses list_entry_rcu() to calculate the entry pointer corresponding to
> >>the list head; however, this isn't an actual RCU dereference and using
> >>list_entry_rcu() for it ended up breaking a proposed list_entry_rcu()
> >>change because it was feeding an non-lvalue pointer into the macro.
> >>
> >>Don't use the RCU variant for simple pointer offsetting.  Use
> >>list_entry() instead.
> >>
> >>Signed-off-by: Tejun Heo 
> >>---
> >>  fs/fs-writeback.c |4 ++--
> >>  1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> >>index 29e4599..7378169 100644
> >>--- a/fs/fs-writeback.c
> >>+++ b/fs/fs-writeback.c
> >>@@ -779,8 +779,8 @@ static void bdi_split_work_to_wbs(struct 
> >>backing_dev_info *bdi,
> >>  bool skip_if_busy)
> >>  {
> >>struct bdi_writeback *last_wb = NULL;
> >>-   struct bdi_writeback *wb = list_entry_rcu(&bdi->wb_list,
> >>-   struct bdi_writeback, bdi_node);
> >>+   struct bdi_writeback *wb = list_entry(&bdi->wb_list,
> >>+ struct bdi_writeback, bdi_node);
> >>
> >>might_sleep();
> >
> >Any objections against me applying this fix to tip:core/rcu so that I can 
> >push the
> >recent RCU changes towards linux-next without triggering a build failure?
> 
> No objection on my side but probably you are waiting for an ack from
> somebody else.

I am guessing that he was asking Tejun, but just for the record, I am
OK with it as well:

Acked-by: Paul E. McKenney 

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] VFIO: Add a parameter to force nonthread IRQ

2015-10-28 Thread Marcelo Tosatti

On Wed, Oct 28, 2015 at 06:05:00PM +0100, Paolo Bonzini wrote:
> 
> 
> On 28/10/2015 17:00, Alex Williamson wrote:
> > > Alex, would it make sense to use the IRQ bypass infrastructure always,
> > > not just for VT-d, to do the MSI injection directly from the VFIO
> > > interrupt handler and bypass the eventfd?  Basically this would add an
> > > RCU-protected list of consumers matching the token to struct
> > > irq_bypass_producer, and a
> > > 
> > >   int (*inject)(struct irq_bypass_consumer *);
> > > 
> > > callback to struct irq_bypass_consumer.  If any callback returns true,
> > > the eventfd is not signaled.
> >
> > Yeah, that might be a good idea, it's probably more plausible than
> > making the eventfd_signal() code friendly to call from hard interrupt
> > context.  On the vfio side can we use request_threaded_irq() directly
> > for this?
> 
> I don't know if that gives you a non-threaded IRQ with the real-time
> kernel...  CCing Marcelo to get some insight.

The vfio interrupt handler (threaded or not) runs at a higher priority
than the vcpu thread. So don't worry about -RT.

About bypass: the smaller number of instructions between device ISR and
injection of interrupt to guest, the better, as that will translate
directly to reduction in interrupt latency times, which is important, as
it determines 

1. how often you can switch from pollmode to ACPI C-states.
2. whether the realtime workload is virtualizable.

The answer to properties of request_threaded_irq() is: don't know.

> > Making the hard irq handler return IRQ_HANDLED if we can use
> > the irq bypass manager or IRQ_WAKE_THREAD if we need to use the eventfd.
> > I think we need some way to get back to irq thread context to use
> > eventfd_signal().
> 
> The irqfd is already able to schedule a work item, because it runs with
> interrupts disabled, so I think we can always return IRQ_HANDLED.
> 
> There's another little complication.  Right now, only x86 has
> kvm_set_msi_inatomic.  We should merge kvm_set_msi_inatomic,
> kvm_set_irq_inatomic and kvm_arch_set_irq.
> 
> Some cleanups are needed there; the flow between the functions is really
> badly structured because the API grew somewhat by accretion.  I'll get
> to it next week or on the way back to Italy.
> 
> > Would we ever not want to use the direct bypass
> > manager path if available?  Thanks,
> 
> I don't think so.  KVM always registers itself as a consumer, even if
> there is no VT-d posted interrupts.  add_producer simply returns -EINVAL
> then.
> 
> Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] oom_kill: add option to disable dump_stack()

2015-10-28 Thread David Rientjes

On Tue, 27 Oct 2015, Aristeu Rozanski wrote:

> Hi Michal,
> On Tue, Oct 27, 2015 at 05:20:47PM +0100, Michal Hocko wrote:
> > Yes this is a mess. But I think it is worth cleaning up.
> > dump_stack_print_info (arch independent) has a log level parameter.
> > show_stack_log_lvl (x86) has a loglevel parameter which is unused.
> > 
> > I haven't checked other architectures but the transition doesn't have to
> > be all at once I guess.
> 
> Ok, will keep working on it then.
> 

No objection on changing the loglevel of the stack trace from the oom 
killer and the bonus is that we can avoid yet another tunable, yay!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] MAINTAINERS: Start using the 'reviewer' (R) tag

2015-10-28 Thread Krzysztof Kozlowski

On 28.10.2015 23:38, Lee Jones wrote:
> On Wed, 28 Oct 2015, Javier Martinez Canillas wrote:
>> They are not maintainers according to your definition of maintainer
>> that doesn't seem what most people agree with.
> 
> "most people" so far are 3 people that I assume still want to be
> Maintainers despite not actually conducting Maintainer duties, but
> are rather Reviewers.  I also have 2 Acks for this patch, so thus far
> that's 3 that agree and 3 that do not.  Unsurprisingly the ones that
> agree are Maintainers and the ones who are not are (by my definition)
> Reviewers -- go figure.
> 

I am not sure on which side you put me finally. :)
If there is a consensus among some more experienced developers that
maintainer means branch and patches maintaining, then I won't see any
problem with the patch nor with switching Samsung PMIC entries to review.

In that case, that would be:
Acked-by: Krzysztof Kozlowski 

Best regards,
Krzysztof

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 答复: [PATCHv2 4.3-rc6] proc: fix convert from oom_score_adj to oom_adj

2015-10-28 Thread David Rientjes

On Wed, 28 Oct 2015, Hongjie Fang (方洪杰) wrote:

> Under a userspace perspective, get a different value than he wrote, 
> it must be confusing.
> 

It's confusing, but with purpose: it shows there is no direct mapping 
between /proc/pid/oom_adj and /proc/pid/oom_score_adj.  
/proc/pid/oom_score_adj is the effective policy and has been for years.  
The value returned by /proc/pid/oom_adj demonstrates reality vs what is 
perceived and is a side-effect of integer division truncating the result 
in C.

It's a bad situation, I agree, and we anticipated the complete removal of 
/proc/pid/oom_adj years ago since it has been deprecated for years.  Maybe 
one day we can convince Linus that is possible, but until then we're stuck 
with it.

Re: [PATCH] get_maintainer: Add subsystem to reviewer output

2015-10-28 Thread Krzysztof Kozlowski

On 29.10.2015 01:41, Joe Perches wrote:
> Reviewer output currently does not include the subsystem
> that matched.  Add it.
> 
> Miscellanea:
> 
> o Add a get_subsystem_name routine to centralize this
> 
> Signed-off-by: Joe Perches 
> ---
>  scripts/get_maintainer.pl | 31 ---
>  1 file changed, 16 insertions(+), 15 deletions(-)
> 

Tested-by: Krzysztof Kozlowski 

Best regards,
Krzysztof

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v13 2/5] gennvm: Generic NVM manager

2015-10-28 Thread Dongsheng Yang


On 10/28/2015 08:30 AM, Matias Bjørling wrote:

The implementation for Open-Channel SSDs is divided into media

[...]

+   lun->reserved_blocks = 2; /* for GC only */
+   lun->vlun.id = i;
+   lun->vlun.lun_id = i % dev->luns_per_chnl;
+   lun->vlun.chnl_id = i / dev->luns_per_chnl;


Please use do_div(). % would be not supported in some platforms, as
the kbuild pointed in V12.

Yang


+   lun->vlun.nr_free_blocks = dev->blks_per_lun;
+   }
+   return 0;
+}
+
+static int gennvm_block_bb(u32 lun_id, void *bb_bitmap, unsigned int nr_blocks,
+   void *private)
+{
+   struct gen_nvm *gn = private;
+   struct gen_lun *lun = &gn->luns[lun_id];
+   struct nvm_block *block;
+   int i;
+
+   if (unlikely(bitmap_empty(bb_bitmap, nr_blocks)))
+   return 0;
+
+   i = -1;
+   while ((i = find_next_bit(bb_bitmap, nr_blocks, i + 1)) <
+   nr_blocks) {
+   block = &lun->vlun.blocks[i];
+   if (!block) {
+   pr_err("gen_nvm: BB data is out of bounds.\n");
+   return -EINVAL;
+   }
+   list_move_tail(&block->list, &lun->bb_list);
+   }
+
+   return 0;
+}
+
+static int gennvm_block_map(u64 slba, u32 nlb, __le64 *entries, void *private)
+{
+   struct nvm_dev *dev = private;
+   struct gen_nvm *gn = dev->mp;
+   sector_t max_pages = dev->total_pages * (dev->sec_size >> 9);
+   u64 elba = slba + nlb;
+   struct gen_lun *lun;
+   struct nvm_block *blk;
+   u64 i;
+   int lun_id;
+
+   if (unlikely(elba > dev->total_pages)) {
+   pr_err("gen_nvm: L2P data from device is out of bounds!\n");
+   return -EINVAL;
+   }
+
+   for (i = 0; i < nlb; i++) {
+   u64 pba = le64_to_cpu(entries[i]);
+
+   if (unlikely(pba >= max_pages && pba != U64_MAX)) {
+   pr_err("gen_nvm: L2P data entry is out of bounds!\n");
+   return -EINVAL;
+   }
+
+   /* Address zero is a special one. The first page on a disk is
+* protected. It often holds internal device boot
+* information.
+*/
+   if (!pba)
+   continue;
+
+   /* resolve block from physical address */
+   lun_id = div_u64(pba, dev->sec_per_lun);
+   lun = &gn->luns[lun_id];
+
+   /* Calculate block offset into lun */
+   pba = pba - (dev->sec_per_lun * lun_id);
+   blk = &lun->vlun.blocks[div_u64(pba, dev->sec_per_blk)];
+
+   if (!blk->type) {
+   /* at this point, we don't know anything about the
+* block. It's up to the FTL on top to re-etablish the
+* block state
+*/
+   list_move_tail(&blk->list, &lun->used_list);
+   blk->type = 1;
+   lun->vlun.nr_free_blocks--;
+   }
+   }
+
+   return 0;
+}
+
+static int gennvm_blocks_init(struct nvm_dev *dev, struct gen_nvm *gn)
+{
+   struct gen_lun *lun;
+   struct nvm_block *block;
+   sector_t lun_iter, blk_iter, cur_block_id = 0;
+   int ret;
+
+   gennvm_for_each_lun(gn, lun, lun_iter) {
+   lun->vlun.blocks = vzalloc(sizeof(struct nvm_block) *
+   dev->blks_per_lun);
+   if (!lun->vlun.blocks)
+   return -ENOMEM;
+
+   for (blk_iter = 0; blk_iter < dev->blks_per_lun; blk_iter++) {
+   block = &lun->vlun.blocks[blk_iter];
+
+   INIT_LIST_HEAD(&block->list);
+
+   block->lun = &lun->vlun;
+   block->id = cur_block_id++;
+
+   /* First block is reserved for device */
+   if (unlikely(lun_iter == 0 && blk_iter == 0))
+   continue;
+
+   list_add_tail(&block->list, &lun->free_list);
+   }
+
+   if (dev->ops->get_bb_tbl) {
+   ret = dev->ops->get_bb_tbl(dev->q, lun->vlun.id,
+   dev->blks_per_lun, gennvm_block_bb, gn);
+   if (ret)
+   pr_err("gen_nvm: could not read BB table\n");
+   }
+   }
+
+   if (dev->ops->get_l2p_tbl) {
+   ret = dev->ops->get_l2p_tbl(dev->q, 0, dev->total_pages,
+   gennvm_block_map, dev);
+   if (ret) {
+   pr_err("gen_nvm: could not read L2P table.\n");
+   pr_warn("gen_nvm: default block initialization");
+

Re: [PATCH v3 1/2] ASoC: wm9713: convert to regmap

2015-10-28 Thread Mark Brown

On Wed, Oct 28, 2015 at 12:43:51PM +, Charles Keepax wrote:
> On Tue, Oct 27, 2015 at 10:58:21PM +0100, Robert Jarzmik wrote:

Please delete unneeded context from mails when replying.  Doing this
makes it much easier to find your reply in the message, helping ensure
it won't be missed by people scrolling through the irrelevant quoted
material.

> Why is the necessary? I can't see an obvious sign that these
> writes bypass the cache in the non-regmap version, am I missing
> something? Also if this is necessary I would quite like it to be
> accompanied by a comment in the code to explain why it is safe to
> do this here. Regarding the inherent dangers of cache bypass I
> explained in my last email.

It's probably worth pointing out that the functionality in the regmap
API is essentially the same as the functionality in the old ASoC cache
code, a conversion should pretty much be a case of directly translating
API calls.

signature.asc
Description: PGP signature

Re: [PATCH v12 5/6] ARM: socfpga: add bindings document for fpga bridge drivers

2015-10-28 Thread Rob Herring

On Tue, Oct 27, 2015 at 5:09 PM,   wrote:
> From: Alan Tull 
>
> Add bindings documentation for Altera SOCFPGA bridges:
>  * fpga2sdram
>  * fpga2hps
>  * hps2fpga
>  * lwhps2fpga
>
> Signed-off-by: Alan Tull 

Oops...

> Signed-off-by: Dinh Nguyen 
> Signed-off-by: Matthew Gerlach 

These should be roughly in order of who did modifications. I'd expect
you to be last.

> ---
> v2:  separate into 2 documents for the 2 drivers
> v12: bump version to line up with simple-fpga-bus version
>  remove Linux specific notes such as references to sysfs
>  move non-DT specific documentation elsewhere
>  remove bindings that would have been used to pass configuration
>  clean up formatting
> ---
>  .../bindings/fpga/altera-fpga2sdram-bridge.txt |   18 ++
>  .../bindings/fpga/altera-hps2fpga-bridge.txt   |   36 
> 
>  2 files changed, 54 insertions(+)
>  create mode 100644 
> Documentation/devicetree/bindings/fpga/altera-fpga2sdram-bridge.txt
>  create mode 100644 
> Documentation/devicetree/bindings/fpga/altera-hps2fpga-bridge.txt
>
> diff --git 
> a/Documentation/devicetree/bindings/fpga/altera-fpga2sdram-bridge.txt 
> b/Documentation/devicetree/bindings/fpga/altera-fpga2sdram-bridge.txt
> new file mode 100644
> index 000..11eb5b7
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/fpga/altera-fpga2sdram-bridge.txt
> @@ -0,0 +1,18 @@
> +Altera FPGA To SDRAM Bridge Driver
> +
> +Required properties:
> +- compatible   : Should contain "altr,socfpga-fpga2sdram-bridge"
> +
> +Optional properties:
> +- label: User-readable name for this bridge.
> + Default is br

Why does the user need label? We generally use label to match physical
labels like "Rear USB port" or "disk LED" or something.

> +- init-val : 0 if driver should disable bridge at startup
> + 1 if driver should enable bridge at startup

Perhaps "bridge-enable" would be a more descriptive name.

And to comment on other replies, I have no problem with this type of
property in the DT. But yes, configuration type properties will get
more scrutiny.

> + Default is to leave bridge in current state.
> +
> +Example:
> +   fpga2sdram_br: fpgabridge@3 {
> +   compatible = "altr,socfpga-fpga2sdram-bridge";
> +   label = "fpga2sdram";
> +   init-val = <0>;
> +   };
> diff --git 
> a/Documentation/devicetree/bindings/fpga/altera-hps2fpga-bridge.txt 
> b/Documentation/devicetree/bindings/fpga/altera-hps2fpga-bridge.txt
> new file mode 100644
> index 000..eb52f3b
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/fpga/altera-hps2fpga-bridge.txt
> @@ -0,0 +1,36 @@
> +Altera FPGA/HPS Bridge Driver
> +
> +Required properties:
> +- compatible   : Should contain one of:
> + "altr,socfpga-hps2fpga-bridge",
> + "altr,socfpga-lwhps2fpga-bridge", or
> + "altr,socfpga-fpga2hps-bridge"
> +- clocks   : Clocks used by this module.
> +
> +Optional properties:
> +- label: User-readable name for this bridge.
> + Default is br
> +- init-val : 0 if driver should disable bridge at startup.
> + 1 if driver should enable bridge at startup.
> + Default is to leave bridge in its current state.
> +
> +Example:
> +   hps_fpgabridge0: fpgabridge@0 {
> +   compatible = "altr,socfpga-hps2fpga-bridge";
> +   label = "hps2fpga";
> +   clocks = <&l4_main_clk>;
> +   init-val = <1>;
> +   };
> +
> +   hps_fpgabridge1: fpgabridge@1 {
> +   compatible = "altr,socfpga-lwhps2fpga-bridge";
> +   label = "lwhps2fpga";
> +   clocks = <&l4_main_clk>;
> +   init-val = <0>;
> +   };
> +
> +   hps_fpgabridge2: fpgabridge@2 {
> +   compatible = "altr,socfpga-fpga2hps-bridge";
> +   label = "fpga2hps";
> +   clocks = <&l4_main_clk>;
> +   };
> --
> 1.7.9.5
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [alsa-devel] [PATCH V2 02/10] ASoC: img: Add driver for I2S input controller

2015-10-28 Thread Mark Brown

On Wed, Oct 28, 2015 at 09:18:20PM +, Damien Horsley wrote:
> On 28/10/15 01:04, Mark Brown wrote:

> >> I think it also makes sense to keep the blocks consistent with each
> >> other. The spdif (out and in), and parallel out, all flush automatically
> >> when stopped, and the fifo for the i2s out block is cleared when the
> >> reset is asserted.

> > This seems like an issue that got missed in the other drivers then.  I'd
> > expect the trigger operation to be a minimal operation which starts and
> > stops the data transfer, not doing anything else.

> The spdif out, spdif in, and parallel out blocks auto-flush whenever
> they are stopped. It is not possible for software to prevent this behavior.

Oh, so this isn't the drivers doing this?  In that case it's fine for
them to do that, if it's what the hardware does it's what the hardware
does.  It sounded like you were saying that there was similar code in
the other drivers.


signature.asc
Description: PGP signature

Re: [PATCH 1/2] mm: mmap: Add new /proc tunable for mmap_base ASLR.

2015-10-28 Thread Eric W. Biederman

Daniel Cashman  writes:

> From: dcashman 
>
> ASLR currently only uses 8 bits to generate the random offset for the
> mmap base address on 32 bit architectures. This value was chosen to
> prevent a poorly chosen value from dividing the address space in such
> a way as to prevent large allocations. This may not be an issue on all
> platforms. Allow the specification of a minimum number of bits so that
> platforms desiring greater ASLR protection may determine where to place
> the trade-off.

This all would be much cleaner if the arm architecture code were just to
register the sysctl itself.

As it sits this looks like a patchset that does not meaninfully bisect,
and would result in code that is hard to trace and understand.

Eric

> Signed-off-by: Daniel Cashman 
> ---
>  Documentation/sysctl/kernel.txt | 14 ++
>  include/linux/mm.h  |  6 ++
>  kernel/sysctl.c | 11 +++
>  3 files changed, 31 insertions(+)
>
> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index 6fccb69..0d4ca53 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -41,6 +41,7 @@ show up in /proc/sys/kernel:
>  - kptr_restrict
>  - kstack_depth_to_print   [ X86 only ]
>  - l2cr[ PPC only ]
> +- mmap_rnd_bits
>  - modprobe==> Documentation/debugging-modules.txt
>  - modules_disabled
>  - msg_next_id  [ sysv ipc ]
> @@ -391,6 +392,19 @@ This flag controls the L2 cache of G3 processor boards. 
> If
>  
>  ==
>  
> +mmap_rnd_bits:
> +
> +This value can be used to select the number of bits to use to
> +determine the random offset to the base address of vma regions
> +resulting from mmap allocations on architectures which support
> +tuning address space randomization.  This value will be bounded
> +by the architecture's minimum and maximum supported values.
> +
> +This value can be changed after boot using the
> +/proc/sys/kernel/mmap_rnd_bits tunable
> +
> +==
> +
>  modules_disabled:
>  
>  A toggle value indicating if modules are allowed to be loaded
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 80001de..15b083a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -51,6 +51,12 @@ extern int sysctl_legacy_va_layout;
>  #define sysctl_legacy_va_layout 0
>  #endif
>  
> +#ifdef CONFIG_ARCH_MMAP_RND_BITS
> +extern int mmap_rnd_bits_min;
> +extern int mmap_rnd_bits_max;
> +extern int mmap_rnd_bits;
> +#endif
> +
>  #include 
>  #include 
>  #include 
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index e69201d..37e657a 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1139,6 +1139,17 @@ static struct ctl_table kern_table[] = {
>   .proc_handler   = timer_migration_handler,
>   },
>  #endif
> +#ifdef CONFIG_ARCH_MMAP_RND_BITS
> + {
> + .procname   = "mmap_rnd_bits",
> + .data   = &mmap_rnd_bits,
> + .maxlen = sizeof(mmap_rnd_bits),
> + .mode   = 0644,
> + .proc_handler   = proc_dointvec_minmax,
> + .extra1 = &mmap_rnd_bits_min,
> + .extra2 = &mmap_rnd_bits_max,
> + },
> +#endif
>   { }
>  };
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v12 2/6] fpga: add bindings document for simple fpga bus

2015-10-28 Thread Rob Herring

On Tue, Oct 27, 2015 at 5:09 PM,   wrote:
> From: Alan Tull 
>
> New bindings document for simple fpga bus.
>
> Signed-off-by: Alan Tull 
> ---
> v9:  initial version added to this patchset
> v10: s/fpga/FPGA/g
>  replace DT overlay example with slightly more complicated example
>  move to staging/simple-fpga-bus
> v11: No change in this patch for v11 of the patch set
> v12: Moved out of staging.
>  Changed to use FPGA bridges framework instead of resets
>  for bridges.
> ---
>  .../devicetree/bindings/fpga/simple-fpga-bus.txt   |   81 
> 
>  1 file changed, 81 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/fpga/simple-fpga-bus.txt
>
> diff --git a/Documentation/devicetree/bindings/fpga/simple-fpga-bus.txt 
> b/Documentation/devicetree/bindings/fpga/simple-fpga-bus.txt
> new file mode 100644
> index 000..2e742f7
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/fpga/simple-fpga-bus.txt
> @@ -0,0 +1,81 @@
> +Simple FPGA Bus
> +===
> +
> +A Simple FPGA Bus is a bus that handles configuring an FPGA and its bridges
> +before populating the devices below its node.  All this happens when a device
> +tree overlay is added to the live tree.  This document describes that device
> +tree overlay.
> +
> +Required properties:
> +- compatible : should contain "simple-fpga-bus"
> +- #address-cells, #size-cells, ranges: must be present to handle address 
> space
> +  mapping for children.
> +
> +Optional properties:
> +- fpga-mgr : should contain a phandle to a FPGA manager.
> +- fpga-firmware : should contain the name of a FPGA image file located on the
> +  firmware search path.

Putting firmware filename in DT has come up in other cases recently[1]
and we concluded it should not be in the DT. Maybe the conclusion
would be different here, and if so we should have a common property
here.

> +- partial-reconfig : boolean property should be defined if partial
> +  reconfiguration of the FPGA is to be done, otherwise full reconfiguration
> +  is done.
> +- fpga-bridges : should contain a list of bridges that the bus will disable
> +  before   programming the FPGA and then enable after the FPGA has been
> +
> +Example:
> +
> +/dts-v1/;
> +/plugin/;
> +/ {
> +   fragment@0 {
> +   target-path="/soc";
> +   __overlay__ {
> +   #address-cells = <1>;
> +   #size-cells = <1>;
> +
> +   bridge@0xff20 {
> +   compatible = "simple-fpga-bus";
> +   reg = <0xc000 0x2000>,
> + <0xff20 0x0020>;

You have registers for the bus, so therefore it is not simple. I think
the bus or bridge needs a specific compatible

Rob

[1] http://www.spinics.net/lists/devicetree/msg92462.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] __div64_32: implement division by multiplication for 32-bit arches

2015-10-28 Thread Nicolas Pitre

On Thu, 29 Oct 2015, Alexey Brodkin wrote:

> Fortunately we already have much better __div64_32() for 32-bit ARM.
> There in case of division by constant preprocessor calculates so-called
> "magic number" which is later used in multiplications instead of divisions.

It's not magic, it is science.  :-)

> It's really nice and very optimal but obviously works only for ARM
> because ARM assembly is involved.
> 
> Now why don't we extend the same approach to all other 32-bit arches
> with multiplication part implemented in pure C. With good compiler
> resulting assembly will be quite close to manually written assembly.

You appear to have left out optimizations where there is no overflow to 
carry.  That, too, can be determined at compile time.

> But there's at least 1 problem which I don't know how to solve.
> Preprocessor magic only happens if __div64_32() is inlined (that's
> obvious - preprocessor has to know if divider is constant or not).
> 
> But __div64_32() is already marked as weak function (which in its turn
> is required to allow some architectures to provide its own optimal
> implementations). I.e. addition of "inline" for __div64_32() is not an
> option.

You can't inline __div64_32().  It should remain as is and used only for 
the slow path.

For the constant based optimization to work, you need to modify do_div() 
in include/asm-generic/div64.h directly.

> So I do want to hear opinions on how to proceed with that patch.
> Indeed there's the simplest solution - use this implementation only in
> my architecture of preference (read ARC) but IMHO this change may
> benefit other architectures as well.
> 
> Signed-off-by: Alexey Brodkin 
> Cc: linux-snps-...@lists.infradead.org
> Cc: Vineet Gupta 
> Cc: Ingo Molnar 
> Cc: Stephen Hemminger 
> Cc: David S. Miller 
> Cc: Nicolas Pitre 

This email address has been unused for the last 7 years. Please update 
your reference.

> Cc: Russell King 
> ---
>  lib/div64.c | 153 
> ++--
>  1 file changed, 128 insertions(+), 25 deletions(-)
> 
> diff --git a/lib/div64.c b/lib/div64.c
> index 62a698a..3055328 100644
> --- a/lib/div64.c
> +++ b/lib/div64.c
> @@ -23,37 +23,140 @@
>  /* Not needed on 64bit architectures */
>  #if BITS_PER_LONG == 32
>  
> +/* our own fls implementation to make sure constant propagation is fine */
> +inline int __div64_fls(int bits)
> +{
> + unsigned int __left = bits, __nr = 0;
> +
> + if (__left & 0x)
> + __nr += 16, __left >>= 16;
> +
> + if (__left & 0xff00)
> + __nr +=  8, __left >>=  8;
> +
> + if (__left & 0x00f0)
> + __nr +=  4, __left >>=  4;
> +
> + if (__left & 0x000c)
> + __nr +=  2, __left >>=  2;
> +
> + if (__left & 0x0002)
> + __nr +=  1;
> +
> + return __nr;
> +}

The regular fls implementation should already give you a constant result 
if provided with a constant input.  To be sure you could use:

__p = 1 << __fls(__b);
BUILD_BUG_ON(!__builtin_constant_p(__p));

> +/*
> + * If the divisor happens to be constant, we determine the appropriate
> + * inverse at compile time to turn the division into a few inline
> + * multiplications instead which is much faster.
> + */
>  uint32_t __attribute__((weak)) __div64_32(uint64_t *n, uint32_t base)
>  {
> - uint64_t rem = *n;
> - uint64_t b = base;
> - uint64_t res, d = 1;
> - uint32_t high = rem >> 32;
> -
> - /* Reduce the thing a bit first */
> - res = 0;
> - if (high >= base) {
> - high /= base;
> - res = (uint64_t) high << 32;
> - rem -= (uint64_t) (high*base) << 32;
> - }
> + unsigned int __r, __b = base;
>  
> - while ((int64_t)b > 0 && b < rem) {
> - b = b+b;
> - d = d+d;
> - }
> + if (!__builtin_constant_p(__b) || __b == 0) {
> + /* non-constant divisor (or zero): slow path */
> + uint64_t rem = *n;
> + uint64_t b = base;
> + uint64_t res, d = 1;
> + uint32_t high = rem >> 32;
> +
> + /* Reduce the thing a bit first */
> + res = 0;
> + if (high >= base) {
> + high /= base;
> + res = (uint64_t) high << 32;
> + rem -= (uint64_t) (high*base) << 32;
> + }
> +
> + while ((int64_t)b > 0 && b < rem) {
> + b = b+b;
> + d = d+d;
> + }
> +
> + do {
> + if (rem >= b) {
> + rem -= b;
> + res += d;
> + }
> + b >>= 1;
> + d >>= 1;
> + } while (d);
>  
> - do {
> - if (rem >= b) {
> - rem -= b;
> - res += d;
> + *n = res;
> + __r

Re: Triggering non-integrity writeback from userspace

2015-10-28 Thread Dave Chinner

On Thu, Oct 29, 2015 at 07:48:34AM +1100, Dave Chinner wrote:
> Hi Andres,
> 
> On Wed, Oct 28, 2015 at 10:27:52AM +0100, Andres Freund wrote:
> > On 2015-10-25 08:39:12 +1100, Dave Chinner wrote:
> 
> > > Data integrity operations require related file metadata (e.g. block
> > > allocation trnascations) to be forced to the journal/disk, and a
> > > device cache flush issued to ensure the data is on stable storage.
> > > SYNC_FILE_RANGE_WRITE does neither of these things, and hence while
> > > the IO might be the same pattern as a data integrity operation, it
> > > does not provide such guarantees.
> > 
> > Which is desired here - the actual integrity is still going to be done
> > via fsync().
> 
> OK, so you require data integrity, but
> 
> > The idea of using SYNC_FILE_RANGE_WRITE beforehand is that
> > the fsync() will only have to do very little work. The language in
> > sync_file_range(2) doesn't inspire enough confidence for using it as an
> > actual integrity operation :/
> 
> So really you're trying to minimise the blocking/latency of fsync()?
> 
> > > You don't want to do writeback from the syscall, right? i.e. you'd
> > > like to expire the inode behind the fd, and schedule background
> > > writeback to run on it immediately?
> > 
> > Yes, that's exactly what we want. Blocking if a process has done too
> > much writes is fine tho.
> 
> OK, so it's really the latency of the fsync() operation that is what
> you are trying to avoid? I've been meaning to get back to a generic
> implementation of an aio fsync operation:
> 
> http://oss.sgi.com/archives/xfs/2014-06/msg00214.html
> 
> Would that be a better approach to solving your need for a
> non-blocking data integrity flush of a file?

Which was relatively trivial to do. Numbers below come from XFS, I
smoke tested ext4 and it kinda worked but behaviour was very
unpredictable and maxxed out at about 25000 IOPS with max
performance being at 4 threads @ an average of 2 files/s...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

[RFC] aio: wire up generic aio_fsync method

From: Dave Chinner 

We've had plenty of requests for an asynchronous fsync over the past
few years, and we've got the infrastructure there to do it. But
nobody has wired it up to test it. The common request we get from
userspace storage applications is to do a post-write pass over a set
of files that were just written (i.e. bulk background fsync) for
point-in-time checkpointing or flushing purposes.

So, just to see if I could brute force an effective implementation,
wire up aio_fsync, add a workqueue and push all the fsync calls off
to the workqueue. The workqueue will allow parallel dispatch, switch
execution if a fsync blocks for any reason, etc. Brute force and
very effective

So, I hacked up fs_mark to enable fsync via the libaio io_fsync()
interface to run some tests. The quick test is:

- write 1 4k files into the cache
- run a post write open-fsync-close pass (sync mode 5)
- run 5 iterations
- run a single thread, then 4 threads.

First I ran it on a 500TB sparse filesystem on a SSD.

FSUse%Count SizeFiles/sec App Overhead
 01 4096507.5   184435
 02 4096527.2   184815
 03 4096530.4   183798
 04 4096531.0   189431
 05 4096554.2   181557

real1m34.548s
user0m0.819s
sys 0m10.596s

Runs at around 500 log forces/s resulting in 500 log writes/s
giving a sustained IO load of about 1200 IOPS.

Using io_fsync():

FSUse%Count SizeFiles/sec App Overhead
 01 4096   4124.1   151359
 02 4096   5506.4   112704
 03 4096   7347.197967
 04 4096   7110.197089
 05 4096   7075.394942

real0m8.554s
user0m0.350s
sys 0m3.684s

Runs at around 7,000 log forces/s, which are mostly aggregated down
to around 700 log writes/s, for a total sustained load of ~8000 IOPS.
The parallel dispatch of fsync operations allows the log to
aggregate them effectively, reducing journal IO by a factor of 10

Run the same workload, 4 threads at a time. Normal fsync:

FSUse%Count SizeFiles/sec App Overhead
 04 4096   2156.0   690185
 08 4096   1859.6   693849
 0   12 4096   1858.8   723889
 0   16 4096   1848.5   708657
 0   20 4096   1842.7   736587

Runs at ~2000 log forces/s, resulting in ~1000 log writes/s and
3,000 IOPS. We see the journal writes being aggregated, but nowhere
near the rate of the previous async fsync ru

Re: Triggering non-integrity writeback from userspace

2015-10-28 Thread Andres Freund

Hi,

On 2015-10-29 07:48:34 +1100, Dave Chinner wrote:
> > The idea of using SYNC_FILE_RANGE_WRITE beforehand is that
> > the fsync() will only have to do very little work. The language in
> > sync_file_range(2) doesn't inspire enough confidence for using it as an
> > actual integrity operation :/
> 
> So really you're trying to minimise the blocking/latency of fsync()?

The blocking/latency of the fsync doesn't actually matter at all *for
this callsite*. It's called from a dedicated background process - if
it's slowed down by a couple seconds it doesn't matter much.
The problem is that if you have a couple gigabytes of dirty data being
fsync()ed at once, latency for concurrent reads and writes often goes
absolutely apeshit. And those concurrent reads and writes might
actually be latency sensitive.

By calling sync_file_range() over small ranges of pages shortly after
they've been written we make it unlikely (but still possible) that much
data has to be flushed at fsync() time.

Should it interesting: The relevant background process is the
"checkpointer" - it writes back all dirty data from postgres' in-memory
shared buffer cache back to disk, then fyncs all files that have been
touched since the last checkpoint (might have independently been
flushed). After that it then can remove the old write-ahead-log/journal.

> > > You don't want to do writeback from the syscall, right? i.e. you'd
> > > like to expire the inode behind the fd, and schedule background
> > > writeback to run on it immediately?
> > 
> > Yes, that's exactly what we want. Blocking if a process has done too
> > much writes is fine tho.
> 
> OK, so it's really the latency of the fsync() operation that is what
> you are trying to avoid? I've been meaning to get back to a generic
> implementation of an aio fsync operation:
> 
> http://oss.sgi.com/archives/xfs/2014-06/msg00214.html
> 
> Would that be a better approach to solving your need for a
> non-blocking data integrity flush of a file?

So an async fsync() isn't that particularly interesting for the
checkpointer/the issue in this thread. But there's another process in
postgres where I could imagine it being useful. We have a "background"
process that regularly flushes the journal to disk. It currently uses
fdatasync() to do so for subsections of a preallocated/reused file. It
tries to sync the sections that in the near future needs to be flushed
to disk because a transaction commits.

I could imagine that it's good for throughput to issue multiple
asynchronous fsyncs in this background process. Might not be good for
latency sensitive workloads tho.

At the moment using fdatasync() instead of fsync() is a considerable
performance advantage... If I understand the above proposal correctly,
it'd allow specifying ranges, is that right?

There'll be some concern about portability around this - issuing
sync_file_range() every now and then isn't particularly invasive. Using
aio might end up being that, not sure.

Greetings,

Andres Freund
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Add SPI to platform_no_drv_owner.cocci warnings

2015-10-28 Thread Mark Brown

On Wed, Oct 28, 2015 at 12:49:07PM -0500, Andrew F. Davis wrote:
> Remove .owner field if calls are used which set it automatically
> 
> Signed-off-by: Andrew F. Davis 
> ---
>  scripts/coccinelle/api/platform_no_drv_owner.cocci | 73 
> ++

You need to send this to whoever the maintainers for the coccinelle
script are.


signature.asc
Description: PGP signature

RE: [PATCH 00/16] staging: comedi: comedi_test: enhancements

2015-10-28 Thread Hartley Sweeten

On Tuesday, October 27, 2015 9:59 AM, Ian Abbott wrote:
> The "comedi_test" module is a driver for a dummy COMEDI device.  It has
> an analog input subdevice and an analog output subdevice.  The analog
> input subdevice supports COMEDI asynchronous acquisition commands using
> waveform generators to generate the input data for each channel.  A
> kernel timer is used to driver the acquisition.
>
> This series of patches cleans up the driver, enhances the existing
> asynchronous command support on the analog input subdevice, and adds
> asynchronous command support on the analog output subdevice.
>
> 01) staging: comedi: comedi_test: reformat multi-line comments
> 02) staging: comedi: comedi_test: saturate fake waveform values
> 03) staging: comedi: comedi_test: remove nano_per_micro
> 04) staging: comedi: comedi_test: limit maximum convert_arg
> 05) staging: comedi: comedi_test: support scan_begin_src == TRIG_FOLLOW
> 06) staging: comedi: comedi_test: move modulo operations for waveform
> 07) staging: comedi: comedi_test: use unsigned int for waveform timing
> 08) staging: comedi: comedi_test: simplify time since last AI scan
> 09) staging: comedi: comedi_test: rename members for AI commands
> 10) staging: comedi: comedi_test: rename waveform members
> 11) staging: comedi: comedi_test: make timer rate similar to scan rate
> 12) staging: comedi: comedi_test: use unsigned short for loopback values
> 13) staging: comedi: comedi_test: allow read-back of AO channels
> 14) staging: comedi: comedi_test: handle partial scans in timer routine
> 15) staging: comedi: comedi_test: rename waveform_ai_interrupt()
> 16) staging: comedi: comedi_test: implement commands on AO subdevice
>
>  drivers/staging/comedi/drivers/comedi_test.c | 565 
> ---
>  1 file changed, 416 insertions(+), 149 deletions(-)

Reviewed-by: H Hartley Sweeten 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] e1000e: Fix msi-x interrupt automask

2015-10-28 Thread Alexander Duyck


On 10/22/2015 05:32 PM, Benjamin Poirier wrote:

Since the introduction of 82574 support in e1000e, the driver has worked on
the assumption that msi-x interrupt generation is automatically disabled
after each irq. As it turns out, this is not the case. Currently, rx
interrupts can fire multiple times before and during napi processing. This
can be a problem for users because frames that arrive in a certain window
(after adapter->clean_rx() but before napi_complete_done() has cleared
NAPI_STATE_SCHED) generate an interrupt which does not lead to
napi_schedule(). These frames sit in the rx queue until another frame
arrives (a tcp retransmit for example).

While the EIAC and CTRL_EXT registers are properly configured for irq
automask, the modification of IAM in e1000_configure_msix() is what
prevents automask from working as intended.

This patch removes that erroneous write and fixes interrupt rearming for tx
and "other" interrupts. Since e1000_msix_other() reads ICR, all interrupts
must be rearmed in that function.

Reported-by: Frank Steiner 
Signed-off-by: Benjamin Poirier 
---
  drivers/net/ethernet/intel/e1000e/netdev.c | 12 ++--
  1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index a228167..8881256 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1921,7 +1921,8 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)

  no_link_interrupt:
if (!test_bit(__E1000_DOWN, &adapter->state))
-   ew32(IMS, E1000_IMS_LSC | E1000_IMS_OTHER);
+   ew32(IMS, adapter->eiac_mask | E1000_IMS_OTHER |
+E1000_IMS_LSC);

return IRQ_HANDLED;
  }


I would argue your first patch probably didn't go far enough to remove 
dead code.  Specifically you should only ever get into this function if 
LSC is set.  There are no other causes that should trigger this.  As 
such you could probably remove the ICR read, and instead replace it with 
an ICR write of the LSC bit since OTHER is already cleared via EIAC.



@@ -1940,6 +1941,9 @@ static irqreturn_t e1000_intr_msix_tx(int __always_unused 
irq, void *data)
/* Ring was not completely cleaned, so fire another interrupt */
ew32(ICS, tx_ring->ims_val);

+   if (!test_bit(__E1000_DOWN, &adapter->state))
+   ew32(IMS, E1000_IMS_TXQ0);
+
return IRQ_HANDLED;
  }



I think what you need to set here is tx_ring->ims_val, not E1000_IMS_TXQ0.


@@ -2027,11 +2031,7 @@ static void e1000_configure_msix(struct e1000_adapter 
*adapter)

/* enable MSI-X PBA support */
ctrl_ext = er32(CTRL_EXT);
-   ctrl_ext |= E1000_CTRL_EXT_PBA_CLR;
-
-   /* Auto-Mask Other interrupts upon ICR read */
-   ew32(IAM, ~E1000_EIAC_MASK_82574 | E1000_IMS_OTHER);
-   ctrl_ext |= E1000_CTRL_EXT_EIAME;
+   ctrl_ext |= E1000_CTRL_EXT_PBA_CLR | E1000_CTRL_EXT_EIAME;
ew32(CTRL_EXT, ctrl_ext);
e1e_flush();
  }



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] cpuidle,menu: use interactivity_req to disable polling

2015-10-28 Thread riel

From: Rik van Riel 

The menu governor carefully figures out how much time we typically
sleep for an estimated sleep interval, or whether there is a repeating
pattern going on, and corrects that estimate for the CPU load.

Then it proceeds to ignore that information when determining whether
or not to consider polling. This is not a big deal on most x86 CPUs,
which have very low C1 latencies, and the patch should not have any
effect on those CPUs.

However, certain CPUs (eg. Atom) have much higher C1 latencies, and
it would be good to not waste performance and power on those CPUs if
we are expecting a very low wakeup latency.

Disable polling based on the estimated interactivity requirement, not
on the time to the next timer interrupt.

Signed-off-by: Rik van Riel 
---
 drivers/cpuidle/governors/menu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index ecc242a586c9..b1a55731f921 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -330,7 +330,7 @@ static int menu_select(struct cpuidle_driver *drv, struct 
cpuidle_device *dev)
 * We want to default to C1 (hlt), not to busy polling
 * unless the timer is happening really really soon.
 */
-   if (data->next_timer_us > 20 &&
+   if (interactivity_req > 20 &&
!drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0)
data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/3] cpuidle,menu: smooth out measured_us calculation

2015-10-28 Thread riel

From: Rik van Riel 

The cpuidle state tables contain the maximum exit latency for each
cpuidle state. On x86, that is the exit latency for when the entire
package goes into that same idle state.

However, a lot of the time we only go into the core idle state,
not the package idle state. This means we see a much smaller exit
latency.

We have no way to detect whether we went into the core or package
idle state while idle, and that is ok.

However, the current menu_update logic does have the potential to
trip up the repeating pattern detection in get_typical_interval.
If the system is experiencing an exit latency near the idle state's
exit latency, some of the samples will have exit_us subtracted,
while others will not. This turns a repeating pattern into mush,
potentially breaking get_typical_interval.

Furthermore, for smaller sleep intervals, we know the chance that
all the cores in the package went to the same idle state are fairly
small. Dividing the measured_us by two, instead of subtracting the
full exit latency when hitting a small measured_us, will reduce the
error.

Signed-off-by: Rik van Riel 
---
 drivers/cpuidle/governors/menu.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index b1a55731f921..7b0971d97cc3 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -404,8 +404,10 @@ static void menu_update(struct cpuidle_driver *drv, struct 
cpuidle_device *dev)
measured_us = cpuidle_get_last_residency(dev);
 
/* Deduct exit latency */
-   if (measured_us > target->exit_latency)
+   if (measured_us > 2 * target->exit_latency)
measured_us -= target->exit_latency;
+   else
+   measured_us /= 2;
 
/* Make sure our coefficients do not exceed unity */
if (measured_us > data->next_timer_us)
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/3] cpuidle: small improvements & fixes for menu governor

2015-10-28 Thread riel

While working on a paravirt cpuidle driver for KVM guests, I
noticed a number of small logic errors in the menu governor
code.

These patches should get rid of some artifacts that can break
the logic in the menu governor under certain corner cases, and
make idle state selection work better on CPUs with long C1 exit
latencies.

I have not seen any adverse effects with them in my (quick)
tests. As expected, they do not seem to do much on systems with
many power states and very low C1 exit latencies and target residencies.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] cpuidle,x86: increase forced cut-off for polling to 20us

2015-10-28 Thread riel

From: Rik van Riel 

The cpuidle menu governor has a forced cut-off for polling at 5us,
in order to deal with firmware that gives the OS bad information
on cpuidle states, leading to the system spending way too much time
in polling.

However, at least one x86 CPU family (Atom) has chips that have
a 20us break-even point for C1. Forcing the polling cut-off to
less than that wastes performance and power.

Increase the polling cut-off to 20us.

Systems with a lower C1 latency will be found in the states table by
the menu governor, which will pick those states as appropriate.

Signed-off-by: Rik van Riel 
---
 drivers/cpuidle/governors/menu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 22e4463d1787..ecc242a586c9 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -330,7 +330,7 @@ static int menu_select(struct cpuidle_driver *drv, struct 
cpuidle_device *dev)
 * We want to default to C1 (hlt), not to busy polling
 * unless the timer is happening really really soon.
 */
-   if (data->next_timer_us > 5 &&
+   if (data->next_timer_us > 20 &&
!drv->states[CPUIDLE_DRIVER_STATE_START].disabled &&
dev->states_usage[CPUIDLE_DRIVER_STATE_START].disable == 0)
data->last_state_idx = CPUIDLE_DRIVER_STATE_START;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] pmem: Add simple and slow fsync/msync support

2015-10-28 Thread Dan Williams

On Thu, Oct 29, 2015 at 7:09 AM, Ross Zwisler
 wrote:
> Make blkdev_issue_flush() behave correctly according to its required
> semantics - all volatile cached data is flushed to stable storage.
>
> Eventually this needs to be replaced with something much more precise by
> tracking dirty DAX entries via the radix tree in struct address_space, but
> for now this gives us correctness even if the performance is quite bad.
>
> Userspace applications looking to avoid the fsync/msync penalty should
> consider more fine-grained flushing via the NVML library:
>
> https://github.com/pmem/nvml
>
> Signed-off-by: Ross Zwisler 
> ---
>  drivers/nvdimm/pmem.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 0ba6a97..eea7997 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -80,7 +80,14 @@ static void pmem_make_request(struct request_queue *q, 
> struct bio *bio)
> if (do_acct)
> nd_iostat_end(bio, start);
>
> -   if (bio_data_dir(bio))
> +   if (bio->bi_rw & REQ_FLUSH) {
> +   void __pmem *addr = pmem->virt_addr + pmem->data_offset;
> +   size_t size = pmem->size - pmem->data_offset;
> +
> +   wb_cache_pmem(addr, size);
> +   }
> +

So I think this will be too expensive to run synchronously in the
submission path for very large pmem ranges and should be farmed out to
an async thread. Then, as long as we're farming it out, might as well
farm it out to more than one cpu.  I'll take a stab at this on the
flight back from KS.

Another optimization is that we can make the flush a nop up until
pmem_direct_access() is first called, because we know there is nothing
to flush when all the i/o is coming through the driver.  That at least
helps the "pmem as a fast SSD" use case avoid the overhead.

Bikeshed alert... wb_cache_pmem() should probably become
mmio_wb_cache() and live next to mmio_flush_cache() since it is not
specific to persistent memory.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 08/10] staging: lustre: remove white space in libcfs_hash.h

2015-10-28 Thread Greg Kroah-Hartman

On Wed, Oct 28, 2015 at 12:54:29PM -0400, James Simmons wrote:
> From: James Simmons 
> 
> Cleanup all the unneeded white space in libcfs_hash.h.
> 
> Signed-off-by: James Simmons 
> ---
>  .../lustre/include/linux/libcfs/libcfs_hash.h  |  147 
> ++--
>  1 files changed, 73 insertions(+), 74 deletions(-)
> 
> diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_hash.h 
> b/drivers/staging/lustre/include/linux/libcfs/libcfs_hash.h
> index 70b8b29..5df8ba2 100644
> --- a/drivers/staging/lustre/include/linux/libcfs/libcfs_hash.h
> +++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_hash.h
> @@ -41,6 +41,9 @@
>  
>  #ifndef __LIBCFS_HASH_H__
>  #define __LIBCFS_HASH_H__
> +
> +#include 
> +
>  /*
>   * Knuth recommends primes in approximately golden ratio to the maximum
>   * integer representable by a machine word for multiplicative hashing.
> @@ -56,22 +59,13 @@
>  /*  2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */
>  #define CFS_GOLDEN_RATIO_PRIME_64 0x9e37fffc0001ULL
>  
> -/*
> - * Ideally we would use HAVE_HASH_LONG for this, but on linux we configure
> - * the linux kernel and user space at the same time, so we need to 
> differentiate
> - * between them explicitly. If this is not needed on other architectures, 
> then
> - * we'll need to move the functions to architecture specific headers.
> - */
> -
> -#include 
> -

That's not "cleaning up whitespace", that's "deleting unused/unneeded
stuff.

Please be more careful and only do one thing per patch, you know better
than to try to sneak other changes in.

I'll stop here in applying this series, please fix up and resend.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 10/10] staging: lustre: remove white space in hash.c

2015-10-28 Thread Greg Kroah-Hartman

On Wed, Oct 28, 2015 at 12:54:31PM -0400, James Simmons wrote:
> From: James Simmons 
> 
> Cleanup all the unneeded white space in hash.c.
> 
> Signed-off-by: James Simmons 
> ---
>  drivers/staging/lustre/lustre/libcfs/hash.c |  336 
> ++-
>  1 files changed, 174 insertions(+), 162 deletions(-)
> 
> diff --git a/drivers/staging/lustre/lustre/libcfs/hash.c 
> b/drivers/staging/lustre/lustre/libcfs/hash.c
> index 0308744..c5921f7 100644
> --- a/drivers/staging/lustre/lustre/libcfs/hash.c
> +++ b/drivers/staging/lustre/lustre/libcfs/hash.c
> @@ -106,9 +106,9 @@
>   *   Now we support both locked iteration & lockless iteration of hash
>   *   table. Also, user can break the iteration by return 1 in callback.
>   */
> +#include 
>  
>  #include "../../include/linux/libcfs/libcfs.h"
> -#include 

Again, not a "whitespace fix".
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness

2015-10-28 Thread Dan Williams

On Thu, Oct 29, 2015 at 7:24 AM, Jeff Moyer  wrote:
> Ross Zwisler  writes:
>
>> This series implements the very slow but correct handling for
>> blkdev_issue_flush() with DAX mappings, as discussed here:
>>
>> https://lkml.org/lkml/2015/10/26/116
>>
>> I don't think that we can actually do the
>>
>> on_each_cpu(sync_cache, ...);
>>
>> ...where sync_cache is something like:
>>
>> cache_disable();
>> wbinvd();
>> pcommit();
>> cache_enable();
>>
>> solution as proposed by Dan because WBINVD + PCOMMIT doesn't guarantee that
>> your writes actually make it durably onto the DIMMs.  I believe you really do
>> need to loop through the cache lines, flush them with CLWB, then fence and
>> PCOMMIT.
>
> *blink*
> *blink*
>
> So much for not violating the principal of least surprise.  I suppose
> you've asked the hardware folks, and they've sent you down this path?

The SDM states that wbinvd only asynchronously "signals" L3 to flush.

>> I do worry that the cost of blindly flushing the entire PMEM namespace on 
>> each
>> fsync or msync will be prohibitively expensive, and that we'll by very
>> incentivized to move to the radix tree based dirty page tracking as soon as
>> possible. :)
>
> Sure, but wbinvd would be quite costly as well.  Either way I think a
> better solution will be required in the near term.
>

As Peter points out the irqoff latency that wbinvd introduces also
makes it not optimal.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 0/4] x86: sigcontext fixes, again

2015-10-28 Thread Toshi Kani

On Wed, 2015-10-28 at 13:22 -0600, Toshi Kani wrote:
> On Wed, 2015-10-28 at 10:34 -0600, Toshi Kani wrote:
> > On Wed, 2015-10-28 at 12:53 +0300, Stas Sergeev wrote:
> > > 28.10.2015 03:04, Toshi Kani пишет:
> > > > On Wed, 2015-10-28 at 07:37 +0900, Linus Torvalds wrote:
> > > > > On Tue, Oct 27, 2015 at 11:05 PM, Stas Sergeev 
> > > > > wrote:
> > > > > > 
> > > > > > I can't easily post an Oops: under X it doesn't even appear -
> > > > > > machine freezes immediately, and under non-KMS console it is
> > > > > > possible to get one, but difficult to screen-shot (using bare
> > > > > > metal, not VM). Also the Oops was seemingly unrelated.
> > > > > > And if you run "dosemu -s" under non-KMS console, you'll also
> > > > > > reproduce this one:
> > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=97321
> > > > > 
> > > > > Hmm. Andrew Morton responded to that initially, but then nothing
> > > > > happened, and now it's been another six months. Andrew?
> > > > > 
> > > > > The arch/x86/mm/pat.c error handling does seem to be suspect. This 
> > > > > is all code several years old, so none of this is new, and I think
> > > > > Suresh is gone.  Adding a few other people with recent sign-offs to 
> > > > > that file, in the hope that somebody feels like they own it..
> > > > 
> > > > In the case of PFNMAP, the range should always be mapped.  So, I 
> > > > wonder why follow_phys() failed with the !pte_present() check.
> > > > 
> > > > Stas, do you have a test program that can reproduce 97321?
> > > Get dosemu2 from here:
> > > https://github.com/stsp/dosemu2/releases
> > > or from git, or get dosemu1.
> > > Then boot your kernel with "nomodeset=1" to get a text console.
> > > Run
> > > 
> > > dosemu -s
> > > 
> > > and you'll get the bug.
> 
> I looked at the dosemu code and was able to reproduce the issue with a test
> program.  This problem happens when mremap() to /dev/mem (or PFNMAP) is
> called with MREMAP_FIXED.
> 
> In this case, mremap calls move_vma(), which first calls move_page_tables()
> to remap the translation and then calls do_munmap() to remove the original
> mapping.  Hence, when untrack_pfn() is called from do_munmap(), the
> original map is already removed, and follow_phys() fails with the
>  !pte_present() check.
> 
> I think there are a couple of issues:
>  - If untrack_pfn() ignores an error from follow_phys() and skips
> free_pfn_range(), PAT continues to track the original map that is removed.
>  - untrack_pfn() calls free_pfn_range() to untrack a given free range. 
>  However, rbt_memtype_erase() requires the free range match exactly to the
> tracked range.  This does not support mremap, which needs to free up part
> of the tracked range.
>  - PAT does not track a new translation specified by mremap() with MREMAP_F
> IXED.

Thinking further, I think the 1st and 3rd items are non-issues.  mremap remaps
virtual address, but keeps the same cache type and pfns.  So, PAT does not have
to change the tracked pfns in this case.  The 2nd item is still a problem,
though. 

Thanks,
-Toshi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V5 4/9] device property: Adding DMA Attribute APIs for Generic Devices

2015-10-28 Thread Suravee Suthikulpanit

The function device_dma_is_coherent() does not sufficiently
communicate device DMA attributes. Instead, this patch introduces
device_get_dma_attr(), which returns enum dev_dma_attr.
It replaces the acpi_check_dma(), which will be removed in
subsequent patch.

This also provides a convenient function, device_dma_supported(),
to check DMA support of the specified device.

Signed-off-by: Suravee Suthikulpanit 
CC: Rafael J. Wysocki 
---
 drivers/base/property.c  | 29 +
 include/linux/property.h |  4 
 2 files changed, 33 insertions(+)

diff --git a/drivers/base/property.c b/drivers/base/property.c
index de40623..05d57a2 100644
--- a/drivers/base/property.c
+++ b/drivers/base/property.c
@@ -611,6 +611,35 @@ bool device_dma_is_coherent(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(device_dma_is_coherent);
 
+bool device_dma_supported(struct device *dev)
+{
+   /* For DT, this is always supported.
+* For ACPI, this depends on CCA, which
+* is determined by the acpi_dma_supported().
+*/
+   if (IS_ENABLED(CONFIG_OF) && dev->of_node)
+   return true;
+
+   return acpi_dma_supported(ACPI_COMPANION(dev));
+}
+EXPORT_SYMBOL_GPL(device_dma_supported);
+
+enum dev_dma_attr device_get_dma_attr(struct device *dev)
+{
+   enum dev_dma_attr attr = DEV_DMA_NOT_SUPPORTED;
+
+   if (IS_ENABLED(CONFIG_OF) && dev->of_node) {
+   if (of_dma_is_coherent(dev->of_node))
+   attr = DEV_DMA_COHERENT;
+   else
+   attr = DEV_DMA_NON_COHERENT;
+   } else
+   attr = acpi_get_dma_attr(ACPI_COMPANION(dev));
+
+   return attr;
+}
+EXPORT_SYMBOL_GPL(device_get_dma_attr);
+
 /**
  * device_get_phy_mode - Get phy mode for given device
  * @dev:   Pointer to the given device
diff --git a/include/linux/property.h b/include/linux/property.h
index 8eecf20..7200490 100644
--- a/include/linux/property.h
+++ b/include/linux/property.h
@@ -176,6 +176,10 @@ void device_add_property_set(struct device *dev, struct 
property_set *pset);
 
 bool device_dma_is_coherent(struct device *dev);
 
+bool device_dma_supported(struct device *dev);
+
+enum dev_dma_attr device_get_dma_attr(struct device *dev);
+
 int device_get_phy_mode(struct device *dev);
 
 void *device_get_mac_address(struct device *dev, char *addr, int alen);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V5 1/9] ACPI: Honor ACPI _CCA attribute setting

2015-10-28 Thread Suravee Suthikulpanit

From: Jeremy Linton 

ACPI configurations can now mark devices as noncoherent,
support that choice.

NOTE: This is required to support USB on ARM Juno Development Board.

Signed-off-by: Jeremy Linton 
Signed-off-by: Suravee Suthikulpanit 
CC: Bjorn Helgaas 
CC: Catalin Marinas 
CC: Rob Herring 
CC: Will Deacon 
CC: Rafael J. Wysocki 
---
 include/acpi/acpi_bus.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h
index d11eff8..0f131d2 100644
--- a/include/acpi/acpi_bus.h
+++ b/include/acpi/acpi_bus.h
@@ -407,7 +407,7 @@ static inline bool acpi_check_dma(struct acpi_device *adev, 
bool *coherent)
 * case 1. Do not support and disable DMA.
 * case 2. Support but rely on arch-specific cache maintenance for
 * non-coherence DMA operations.
-* Currently, we implement case 1 above.
+* Currently, we implement case 2 above.
 *
 * For the case when _CCA is missing (i.e. cca_seen=0) and
 * platform specifies ACPI_CCA_REQUIRED, we do not support DMA,
@@ -415,7 +415,8 @@ static inline bool acpi_check_dma(struct acpi_device *adev, 
bool *coherent)
 *
 * See acpi_init_coherency() for more info.
 */
-   if (adev->flags.coherent_dma) {
+   if (adev->flags.coherent_dma ||
+   (adev->flags.cca_seen && IS_ENABLED(CONFIG_ARM64))) {
ret = true;
if (coherent)
*coherent = adev->flags.coherent_dma;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V5 0/9] PCI: ACPI: Setting up DMA coherency for PCI device from _CCA attribute

2015-10-28 Thread Suravee Suthikulpanit

This patch series adds support to setup DMA coherency for PCI device using
the ACPI _CCA attribute. According to the ACPI spec, the _CCA attribute
is required for ARM64. Therefore, this patch is a pre-req for ACPI PCI
support for ARM64 which is currently in development.  Also, this should
not affect other architectures that does not define 
CONFIG_ACPI_CCA_REQUIRED, since the default value is coherent.

In the process, this series also introduces enum dev_dma_attr and a set
of APIs to query device DMA attribute. These APIs replace the obsolete
device_dma_is_coherent(), and acpi_check_dma().

I have also included a patch from Jeremy posted here:
http://www.spinics.net/lists/linux-usb/msg128582.html

This patch series is now rebased from:
https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git 
linux-next

This patch series has been tested on AMD Seattle RevB platform.
The git tree containing tested code and pre-req patches is available here:
http://github.com/ssuthiku/linux.git pci-cca-v5

Changes from V4: (https://lkml.org/lkml/2015/10/21/535)
* Clean up from Hanjun, Bjorn, and Thomas review comments
* Rebased on top of Rafael's latest linux-next branch
* Added patch 7/9 to fix the  pci_get_host_bridge_device leak
* Added Acked-by from Bjorn
* Added Reviewed-by from Hanjun

Changes from V3: (https://lkml.org/lkml/2015/8/26/389)
* Clean up suggested by Bjorn
* Introduce enum dev_dma_attr
* Replace device_dma_is_coherent() and acpi_check_dma() with
  new APIs.

Changes from V2: (https://lkml.org/lkml/2015/8/25/549)
* Return -ENOSUPP instead of -1 (per Rafael's suggestion)
* Add WARN() when fail to setup DMA for PCI device when booting
  ACPI (per Arnd's suggestion)
* Added Acked-by from Rob.
* Minor clean up

Changes from V1: (https://lkml.org/lkml/2015/8/13/182)
* Include patch 1 from Jeremy to enable support for _CCA=0
* Clean up acpi_check_dma() per Bjorn suggestions
* Split the original V1 patch into two patches (patch 3 and 4)

Jeremy Linton (1):
  ACPI: Honor ACPI _CCA attribute setting

Suravee Suthikulpanit (8):
  device property: Introducing enum dev_dma_attr
  ACPI: Adding DMA Attribute APIs for ACPI Device
  device property: Adding DMA Attribute APIs for Generic Devices
  device property: ACPI: Make use of the new DMA Attribute APIs
  device property: ACPI: Remove unused DMA APIs
  of/pci: Fix pci_get_host_bridge_device leak
  PCI: OF: Move of_pci_dma_configure() to pci_dma_configure()
  PCI: ACPI: Add support for PCI device DMA coherency

 drivers/acpi/acpi_platform.c  |  7 +-
 drivers/acpi/glue.c   |  8 +++---
 drivers/acpi/scan.c   | 42 +++
 drivers/base/property.c   | 32 +--
 drivers/crypto/ccp/ccp-platform.c | 15 ---
 drivers/net/ethernet/amd/xgbe/xgbe-main.c |  8 +-
 drivers/of/of_pci.c   | 20 ---
 drivers/pci/probe.c   | 33 ++--
 include/acpi/acpi_bus.h   | 36 +++---
 include/linux/acpi.h  |  7 +-
 include/linux/of_pci.h|  3 ---
 include/linux/property.h  | 10 +++-
 12 files changed, 144 insertions(+), 77 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[net-next PATCH] RDS: convert bind hash table to re-sizable hashtable

2015-10-28 Thread Santosh Shilimkar

To further improve the RDS connection scalabilty on massive systems
where number of sockets grows into tens of thousands  of sockets, there
is a need of larger bind hashtable. Pre-allocated 8K or 16K table is
not very flexible in terms of memory utilisation. The rhashtable
infrastructure gives us the flexibility to grow the hashtbable based
on use and also comes up with inbuilt efficient bucket(chain) handling.

Reviewed-by: David Miller 
Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/af_rds.c |  10 -
 net/rds/bind.c   | 126 +++
 net/rds/rds.h|   7 +++-
 3 files changed, 57 insertions(+), 86 deletions(-)

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index 384ea1e..b5476aeb 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -573,6 +573,7 @@ static void rds_exit(void)
rds_threads_exit();
rds_stats_exit();
rds_page_exit();
+   rds_bind_lock_destroy();
rds_info_deregister_func(RDS_INFO_SOCKETS, rds_sock_info);
rds_info_deregister_func(RDS_INFO_RECV_MESSAGES, rds_sock_inc_info);
 }
@@ -582,11 +583,14 @@ static int rds_init(void)
 {
int ret;
 
-   rds_bind_lock_init();
+   ret = rds_bind_lock_init();
+   if (ret)
+   goto out;
 
ret = rds_conn_init();
if (ret)
-   goto out;
+   goto out_bind;
+
ret = rds_threads_init();
if (ret)
goto out_conn;
@@ -620,6 +624,8 @@ out_conn:
rds_conn_exit();
rds_cong_exit();
rds_page_exit();
+out_bind:
+   rds_bind_lock_destroy();
 out:
return ret;
 }
diff --git a/net/rds/bind.c b/net/rds/bind.c
index 6192566..2b00222 100644
--- a/net/rds/bind.c
+++ b/net/rds/bind.c
@@ -38,54 +38,17 @@
 #include 
 #include "rds.h"
 
-struct bind_bucket {
-   rwlock_tlock;
-   struct hlist_head   head;
+static struct rhashtable bind_hash_table;
+
+static struct rhashtable_params ht_parms = {
+   .nelem_hint = 768,
+   .key_len = sizeof(u64),
+   .key_offset = offsetof(struct rds_sock, rs_bound_key),
+   .head_offset = offsetof(struct rds_sock, rs_bound_node),
+   .max_size = 16384,
+   .min_size = 1024,
 };
 
-#define BIND_HASH_SIZE 1024
-static struct bind_bucket bind_hash_table[BIND_HASH_SIZE];
-
-static struct bind_bucket *hash_to_bucket(__be32 addr, __be16 port)
-{
-   return bind_hash_table + (jhash_2words((u32)addr, (u32)port, 0) &
- (BIND_HASH_SIZE - 1));
-}
-
-/* must hold either read or write lock (write lock for insert != NULL) */
-static struct rds_sock *rds_bind_lookup(struct bind_bucket *bucket,
-   __be32 addr, __be16 port,
-   struct rds_sock *insert)
-{
-   struct rds_sock *rs;
-   struct hlist_head *head = &bucket->head;
-   u64 cmp;
-   u64 needle = ((u64)be32_to_cpu(addr) << 32) | be16_to_cpu(port);
-
-   hlist_for_each_entry(rs, head, rs_bound_node) {
-   cmp = ((u64)be32_to_cpu(rs->rs_bound_addr) << 32) |
- be16_to_cpu(rs->rs_bound_port);
-
-   if (cmp == needle) {
-   rds_sock_addref(rs);
-   return rs;
-   }
-   }
-
-   if (insert) {
-   /*
-* make sure our addr and port are set before
-* we are added to the list.
-*/
-   insert->rs_bound_addr = addr;
-   insert->rs_bound_port = port;
-   rds_sock_addref(insert);
-
-   hlist_add_head(&insert->rs_bound_node, head);
-   }
-   return NULL;
-}
-
 /*
  * Return the rds_sock bound at the given local address.
  *
@@ -94,18 +57,14 @@ static struct rds_sock *rds_bind_lookup(struct bind_bucket 
*bucket,
  */
 struct rds_sock *rds_find_bound(__be32 addr, __be16 port)
 {
+   u64 key = ((u64)addr << 32) | port;
struct rds_sock *rs;
-   unsigned long flags;
-   struct bind_bucket *bucket = hash_to_bucket(addr, port);
 
-   read_lock_irqsave(&bucket->lock, flags);
-   rs = rds_bind_lookup(bucket, addr, port, NULL);
-   read_unlock_irqrestore(&bucket->lock, flags);
-
-   if (rs && sock_flag(rds_rs_to_sk(rs), SOCK_DEAD)) {
-   rds_sock_put(rs);
+   rs = rhashtable_lookup_fast(&bind_hash_table, &key, ht_parms);
+   if (rs && !sock_flag(rds_rs_to_sk(rs), SOCK_DEAD))
+   rds_sock_addref(rs);
+   else
rs = NULL;
-   }
 
rdsdebug("returning rs %p for %pI4:%u\n", rs, &addr,
ntohs(port));
@@ -116,10 +75,9 @@ struct rds_sock *rds_find_bound(__be32 addr, __be16 port)
 /* returns -ve errno or +ve port */
 static int rds_add_bound(struct rds_sock *rs, __be32 addr, __be16 *port)
 {
-   unsigned long flags;
int ret = -EADDRINUSE;
u16 rover, last;
-

[PATCH V5 3/9] ACPI: Adding DMA Attribute APIs for ACPI Device

2015-10-28 Thread Suravee Suthikulpanit

Adding acpi_get_dma_attr() to query DMA attributes of ACPI devices.
It returns the enum dev_dma_attr, which communicates DMA information
more clearly. This API replaces the acpi_check_dma(), which will be
removed in subsequent patch.

This patch also provides a convenient function, acpi_dma_supported(),
to check DMA support of the specified ACPI device.

Signed-off-by: Suravee Suthikulpanit 
CC: Rafael J. Wysocki 
---
 drivers/acpi/scan.c | 42 ++
 include/acpi/acpi_bus.h |  3 +++
 include/linux/acpi.h| 10 ++
 3 files changed, 55 insertions(+)

diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index daf9fc8..78d5f02 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -1308,6 +1308,48 @@ void acpi_free_pnp_ids(struct acpi_device_pnp *pnp)
kfree(pnp->unique_id);
 }
 
+/**
+ * acpi_dma_supported - Check DMA support for the specified device.
+ * @adev: The pointer to acpi device
+ *
+ * Return false if DMA is not supported. Otherwise, return true
+ */
+bool acpi_dma_supported(struct acpi_device *adev)
+{
+   if (!adev)
+   return false;
+
+   if (adev->flags.cca_seen)
+   return true;
+
+   /*
+   * Per ACPI 6.0 sec 6.2.17, assume devices can do cache-coherent
+   * DMA on "Intel platforms".  Presumably that includes all x86 and
+   * ia64, and other arches will set CONFIG_ACPI_CCA_REQUIRED=y.
+   */
+   if (!IS_ENABLED(CONFIG_ACPI_CCA_REQUIRED))
+   return true;
+
+   return false;
+}
+
+/**
+ * acpi_get_dma_attr - Check the supported DMA attr for the specified device.
+ * @adev: The pointer to acpi device
+ *
+ * Return enum dev_dma_attr.
+ */
+enum dev_dma_attr acpi_get_dma_attr(struct acpi_device *adev)
+{
+   if (!acpi_dma_supported(adev))
+   return DEV_DMA_NOT_SUPPORTED;
+
+   if (adev->flags.coherent_dma)
+   return DEV_DMA_COHERENT;
+   else
+   return DEV_DMA_NON_COHERENT;
+}
+
 static void acpi_init_coherency(struct acpi_device *adev)
 {
unsigned long long cca = 0;
diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h
index 0f131d2..920b774 100644
--- a/include/acpi/acpi_bus.h
+++ b/include/acpi/acpi_bus.h
@@ -596,6 +596,9 @@ struct acpi_pci_root {
 
 /* helper */
 
+bool acpi_dma_supported(struct acpi_device *adev);
+enum dev_dma_attr acpi_get_dma_attr(struct acpi_device *adev);
+
 struct acpi_device *acpi_find_child_device(struct acpi_device *parent,
   u64 address, bool check_children);
 int acpi_is_root_bridge(acpi_handle);
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 82f56bb..6527920 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -598,6 +598,16 @@ static inline bool acpi_check_dma(struct acpi_device 
*adev, bool *coherent)
return false;
 }
 
+static inline bool acpi_dma_supported(struct acpi_device *adev)
+{
+   return false;
+}
+
+static inline enum dev_dma_attr acpi_get_dma_attr(struct acpi_device *adev)
+{
+   return DEV_DMA_NOT_SUPPORTED;
+}
+
 #define ACPI_PTR(_ptr) (NULL)
 
 #endif /* !CONFIG_ACPI */
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V5 5/9] device property: ACPI: Make use of the new DMA Attribute APIs

2015-10-28 Thread Suravee Suthikulpanit

Now that we have the new DMA attribute APIs, we can replace the older
acpi_check_dma() and device_dma_is_coherent().

Signed-off-by: Suravee Suthikulpanit 
CC: Rafael J. Wysocki 
CC: Tom Lendacky 
CC: Herbert Xu 
CC: David S. Miller 
---
 drivers/acpi/acpi_platform.c  |  7 ++-
 drivers/acpi/glue.c   |  8 +---
 drivers/crypto/ccp/ccp-platform.c | 15 +++
 drivers/net/ethernet/amd/xgbe/xgbe-main.c |  8 +++-
 4 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/drivers/acpi/acpi_platform.c b/drivers/acpi/acpi_platform.c
index 06a67d5..296b7a1 100644
--- a/drivers/acpi/acpi_platform.c
+++ b/drivers/acpi/acpi_platform.c
@@ -103,7 +103,12 @@ struct platform_device *acpi_create_platform_device(struct 
acpi_device *adev)
pdevinfo.res = resources;
pdevinfo.num_res = count;
pdevinfo.fwnode = acpi_fwnode_handle(adev);
-   pdevinfo.dma_mask = acpi_check_dma(adev, NULL) ? DMA_BIT_MASK(32) : 0;
+
+   if (acpi_dma_supported(adev))
+   pdevinfo.dma_mask = DMA_BIT_MASK(32);
+   else
+   pdevinfo.dma_mask = 0;
+
pdev = platform_device_register_full(&pdevinfo);
if (IS_ERR(pdev))
dev_err(&adev->dev, "platform device creation failed: %ld\n",
diff --git a/drivers/acpi/glue.c b/drivers/acpi/glue.c
index 1470ae4..5ea5dc2 100644
--- a/drivers/acpi/glue.c
+++ b/drivers/acpi/glue.c
@@ -168,7 +168,7 @@ int acpi_bind_one(struct device *dev, struct acpi_device 
*acpi_dev)
struct list_head *physnode_list;
unsigned int node_id;
int retval = -EINVAL;
-   bool coherent;
+   enum dev_dma_attr attr;
 
if (has_acpi_companion(dev)) {
if (acpi_dev) {
@@ -225,8 +225,10 @@ int acpi_bind_one(struct device *dev, struct acpi_device 
*acpi_dev)
if (!has_acpi_companion(dev))
ACPI_COMPANION_SET(dev, acpi_dev);
 
-   if (acpi_check_dma(acpi_dev, &coherent))
-   arch_setup_dma_ops(dev, 0, 0, NULL, coherent);
+   attr = acpi_get_dma_attr(acpi_dev);
+   if (attr != DEV_DMA_NOT_SUPPORTED)
+   arch_setup_dma_ops(dev, 0, 0, NULL,
+  attr == DEV_DMA_COHERENT);
 
acpi_physnode_link_name(physical_node_name, node_id);
retval = sysfs_create_link(&acpi_dev->dev.kobj, &dev->kobj,
diff --git a/drivers/crypto/ccp/ccp-platform.c 
b/drivers/crypto/ccp/ccp-platform.c
index bb241c3..844118c 100644
--- a/drivers/crypto/ccp/ccp-platform.c
+++ b/drivers/crypto/ccp/ccp-platform.c
@@ -96,6 +96,7 @@ static int ccp_platform_probe(struct platform_device *pdev)
struct ccp_platform *ccp_platform;
struct device *dev = &pdev->dev;
struct acpi_device *adev = ACPI_COMPANION(dev);
+   enum dev_dma_attr attr;
struct resource *ior;
int ret;
 
@@ -122,18 +123,24 @@ static int ccp_platform_probe(struct platform_device 
*pdev)
}
ccp->io_regs = ccp->io_map;
 
-   ret = dma_set_mask_and_coherent(dev, DMA_BIT_MASK(48));
-   if (ret) {
-   dev_err(dev, "dma_set_mask_and_coherent failed (%d)\n", ret);
+   attr = device_get_dma_attr(dev);
+   if (attr == DEV_DMA_NOT_SUPPORTED) {
+   dev_err(dev, "DMA is not supported");
goto e_err;
}
 
-   ccp_platform->coherent = device_dma_is_coherent(ccp->dev);
+   ccp_platform->coherent = (attr == DEV_DMA_COHERENT);
if (ccp_platform->coherent)
ccp->axcache = CACHE_WB_NO_ALLOC;
else
ccp->axcache = CACHE_NONE;
 
+   ret = dma_set_mask_and_coherent(dev, DMA_BIT_MASK(48));
+   if (ret) {
+   dev_err(dev, "dma_set_mask_and_coherent failed (%d)\n", ret);
+   goto e_err;
+   }
+
dev_set_drvdata(dev, ccp);
 
ret = ccp_init(ccp);
diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-main.c 
b/drivers/net/ethernet/amd/xgbe/xgbe-main.c
index e83bd76..c607b3f 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-main.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-main.c
@@ -342,6 +342,7 @@ static int xgbe_probe(struct platform_device *pdev)
struct resource *res;
const char *phy_mode;
unsigned int i, phy_memnum, phy_irqnum;
+   enum dev_dma_attr attr;
int ret;
 
DBGPR("--> xgbe_probe\n");
@@ -609,7 +610,12 @@ static int xgbe_probe(struct platform_device *pdev)
goto err_io;
 
/* Set the DMA coherency values */
-   pdata->coherent = device_dma_is_coherent(pdata->dev);
+   attr = device_get_dma_attr(dev);
+   if (attr == DEV_DMA_NOT_SUPPORTED) {
+   dev_err(dev, "DMA is not supported");
+   goto err_io;
+   }
+   pdata->coherent = (attr == DEV_DMA_COHERENT);
if (pdata->coherent) {
pdata->axdomain = XGBE_DMA_OS_AXDOMAIN;
pdata->arcache = XGBE_DMA_OS_ARCACHE;
-- 
2.1.0

--
To unsubscribe from

[PATCH V5 8/9] PCI: OF: Move of_pci_dma_configure() to pci_dma_configure()

2015-10-28 Thread Suravee Suthikulpanit

This patch move of_pci_dma_configure() to a more generic
pci_dma_configure(), which can be extended by non-OF code (e.g. ACPI).

This has no functional change.

Signed-off-by: Suravee Suthikulpanit 
Acked-by: Rob Herring 
Acked-by: Bjorn Helgaas 
Reviewed-by: Hanjun Guo 
CC: Rafael J. Wysocki 
---
 drivers/of/of_pci.c| 19 ---
 drivers/pci/probe.c| 23 +--
 include/linux/of_pci.h |  3 ---
 3 files changed, 21 insertions(+), 24 deletions(-)

diff --git a/drivers/of/of_pci.c b/drivers/of/of_pci.c
index a2f510c..b66ee4e 100644
--- a/drivers/of/of_pci.c
+++ b/drivers/of/of_pci.c
@@ -117,25 +117,6 @@ int of_get_pci_domain_nr(struct device_node *node)
 }
 EXPORT_SYMBOL_GPL(of_get_pci_domain_nr);
 
-/**
- * of_pci_dma_configure - Setup DMA configuration
- * @dev: ptr to pci_dev struct of the PCI device
- *
- * Function to update PCI devices's DMA configuration using the same
- * info from the OF node of host bridge's parent (if any).
- */
-void of_pci_dma_configure(struct pci_dev *pci_dev)
-{
-   struct device *dev = &pci_dev->dev;
-   struct device *bridge = pci_get_host_bridge_device(pci_dev);
-
-   if (bridge->parent)
-   of_dma_configure(dev, bridge->parent->of_node);
-
-   pci_put_host_bridge_device(bridge);
-}
-EXPORT_SYMBOL_GPL(of_pci_dma_configure);
-
 #if defined(CONFIG_OF_ADDRESS)
 /**
  * of_pci_get_host_bridge_resources - Parse PCI host bridge resources from DT
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 8361d27..31e3eef 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -6,7 +6,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -1633,6 +1633,25 @@ static void pci_set_msi_domain(struct pci_dev *dev)
   dev_get_msi_domain(&dev->bus->dev));
 }
 
+/**
+ * pci_dma_configure - Setup DMA configuration
+ * @dev: ptr to pci_dev struct of the PCI device
+ *
+ * Function to update PCI devices's DMA configuration using the same
+ * info from the OF node of host bridge's parent (if any).
+ */
+static void pci_dma_configure(struct pci_dev *dev)
+{
+   struct device *bridge = pci_get_host_bridge_device(dev);
+
+   if (IS_ENABLED(CONFIG_OF) && dev->dev.of_node) {
+   if (bridge->parent)
+   of_dma_configure(&dev->dev, bridge->parent->of_node);
+   }
+
+   pci_put_host_bridge_device(bridge);
+}
+
 void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
 {
int ret;
@@ -1646,7 +1665,7 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus 
*bus)
dev->dev.dma_mask = &dev->dma_mask;
dev->dev.dma_parms = &dev->dma_parms;
dev->dev.coherent_dma_mask = 0xull;
-   of_pci_dma_configure(dev);
+   pci_dma_configure(dev);
 
pci_set_dma_max_seg_size(dev, 65536);
pci_set_dma_seg_boundary(dev, 0x);
diff --git a/include/linux/of_pci.h b/include/linux/of_pci.h
index 29fd3fe..ce0e5ab 100644
--- a/include/linux/of_pci.h
+++ b/include/linux/of_pci.h
@@ -16,7 +16,6 @@ int of_pci_get_devfn(struct device_node *np);
 int of_irq_parse_and_map_pci(const struct pci_dev *dev, u8 slot, u8 pin);
 int of_pci_parse_bus_range(struct device_node *node, struct resource *res);
 int of_get_pci_domain_nr(struct device_node *node);
-void of_pci_dma_configure(struct pci_dev *pci_dev);
 #else
 static inline int of_irq_parse_pci(const struct pci_dev *pdev, struct 
of_phandle_args *out_irq)
 {
@@ -51,8 +50,6 @@ of_get_pci_domain_nr(struct device_node *node)
 {
return -1;
 }
-
-static inline void of_pci_dma_configure(struct pci_dev *pci_dev) { }
 #endif
 
 #if defined(CONFIG_OF_ADDRESS)
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-10-28 Thread Andy Lutomirski

On Wed, Oct 28, 2015 at 9:12 AM, Michael S. Tsirkin  wrote:
> On Wed, Oct 28, 2015 at 11:32:34PM +0900, David Woodhouse wrote:
>> > I don't have a problem with extending DMA API to address
>> > more usecases.
>>
>> No, this isn't an extension. This is fixing a bug, on certain platforms
>> where the DMA API has currently done the wrong thing.
>>
>> We have historically worked around that bug by introducing *another*
>> bug, which is not to *use* the DMA API in the virtio driver.
>>
>> Sure, we can co-ordinate those two bug-fixes. But let's not talk about
>> them as anything other than bug-fixes.
>
> It was pretty practical not to use it. All virtio devices at the time
> without exception bypassed the IOMMU, so it was a question of omitting a
> couple of function calls in virtio versus hacking on DMA implementation
> on multiple platforms. We have more policy options now, so I agree it's
> time to revisit this.
>
> But for me, the most important thing is that we do coordinate.
>
>> > > Drivers use DMA API. No more talky.
>> >
>> > Well for virtio they don't ATM. And 1:1 mapping makes perfect sense
>> > for the wast majority of users, so I can't switch them over
>> > until the DMA API actually addresses all existing usecases.
>>
>> That's still not your business; it's the platform's. And there are
>> hardware implementations of the virtio protocols on real PCI cards. And
>> we have the option of doing IOMMU translation for the virtio devices
>> even in a virtual machine. Just don't get involved.
>>
>> --
>> dwmw2
>>
>>
>
> I'm involved anyway, it's possible not to put all the code in the virtio
> subsystem in guest though.  But I suspect we'll need to find a way for
> non-linux drivers within guest to work correctly too, and they might
> have trouble poking at things at the system level.  So possibly virtio
> subsystem will have to tell platform "this device wants to bypass IOMMU"
> and then DMA API does the right thing.
>

After some discussion at KS, no one came up with an example where it's
necessary, and the patches to convert virtqueue to use the DMA API are
much nicer when they convert it unconditionally.

The two interesting cases we thought of were PPC and x86's emulated
Q35 IOMMU.  PPC will look in to architecting a devicetree-based way to
indicate passthrough status and will add quirks for the existing
virtio devices.  Everyone seems to agree that x86's emulated Q35 thing
is just buggy right now and should be taught to use the existing ACPI
mechanism for enumerating passthrough devices.

I'll send a new version of the series soon.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V5 7/9] of/pci: Fix pci_get_host_bridge_device leak

2015-10-28 Thread Suravee Suthikulpanit

In case of error, the current code return w/o calling
pci_put_host_bridge_device. This patch fixes this.

Signed-off-by: Suravee Suthikulpanit 
Acked-by: Bjorn Helgaas 
---
 drivers/of/of_pci.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/of/of_pci.c b/drivers/of/of_pci.c
index 5751dc5..a2f510c 100644
--- a/drivers/of/of_pci.c
+++ b/drivers/of/of_pci.c
@@ -129,10 +129,9 @@ void of_pci_dma_configure(struct pci_dev *pci_dev)
struct device *dev = &pci_dev->dev;
struct device *bridge = pci_get_host_bridge_device(pci_dev);
 
-   if (!bridge->parent)
-   return;
+   if (bridge->parent)
+   of_dma_configure(dev, bridge->parent->of_node);
 
-   of_dma_configure(dev, bridge->parent->of_node);
pci_put_host_bridge_device(bridge);
 }
 EXPORT_SYMBOL_GPL(of_pci_dma_configure);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness

2015-10-28 Thread Ross Zwisler

On Wed, Oct 28, 2015 at 06:24:29PM -0400, Jeff Moyer wrote:
> Ross Zwisler  writes:
> 
> > This series implements the very slow but correct handling for
> > blkdev_issue_flush() with DAX mappings, as discussed here:
> >
> > https://lkml.org/lkml/2015/10/26/116
> >
> > I don't think that we can actually do the
> >
> > on_each_cpu(sync_cache, ...);
> >
> > ...where sync_cache is something like:
> >
> > cache_disable();
> > wbinvd();
> > pcommit();
> > cache_enable();
> >
> > solution as proposed by Dan because WBINVD + PCOMMIT doesn't guarantee that
> > your writes actually make it durably onto the DIMMs.  I believe you really 
> > do
> > need to loop through the cache lines, flush them with CLWB, then fence and
> > PCOMMIT.
> 
> *blink*
> *blink*
> 
> So much for not violating the principal of least surprise.  I suppose
> you've asked the hardware folks, and they've sent you down this path?

Sadly, yes, this was the guidance from the hardware folks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V5 6/9] device property: ACPI: Remove unused DMA APIs

2015-10-28 Thread Suravee Suthikulpanit

These DMA APIs are replaced with the newer versions, which return
the enum dev_dma_attr. So, we can safely remove them.

Signed-off-by: Suravee Suthikulpanit 
CC: Rafael J. Wysocki 
---
 drivers/base/property.c  | 13 -
 include/acpi/acpi_bus.h  | 34 --
 include/linux/acpi.h |  5 -
 include/linux/property.h |  2 --
 4 files changed, 54 deletions(-)

diff --git a/drivers/base/property.c b/drivers/base/property.c
index 05d57a2..1325ff2 100644
--- a/drivers/base/property.c
+++ b/drivers/base/property.c
@@ -598,19 +598,6 @@ unsigned int device_get_child_node_count(struct device 
*dev)
 }
 EXPORT_SYMBOL_GPL(device_get_child_node_count);
 
-bool device_dma_is_coherent(struct device *dev)
-{
-   bool coherent = false;
-
-   if (IS_ENABLED(CONFIG_OF) && dev->of_node)
-   coherent = of_dma_is_coherent(dev->of_node);
-   else
-   acpi_check_dma(ACPI_COMPANION(dev), &coherent);
-
-   return coherent;
-}
-EXPORT_SYMBOL_GPL(device_dma_is_coherent);
-
 bool device_dma_supported(struct device *dev)
 {
/* For DT, this is always supported.
diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h
index 920b774..ad0a5ff 100644
--- a/include/acpi/acpi_bus.h
+++ b/include/acpi/acpi_bus.h
@@ -390,40 +390,6 @@ struct acpi_data_node {
struct completion kobj_done;
 };
 
-static inline bool acpi_check_dma(struct acpi_device *adev, bool *coherent)
-{
-   bool ret = false;
-
-   if (!adev)
-   return ret;
-
-   /**
-* Currently, we only support _CCA=1 (i.e. coherent_dma=1)
-* This should be equivalent to specifyig dma-coherent for
-* a device in OF.
-*
-* For the case when _CCA=0 (i.e. coherent_dma=0 && cca_seen=1),
-* There are two cases:
-* case 1. Do not support and disable DMA.
-* case 2. Support but rely on arch-specific cache maintenance for
-* non-coherence DMA operations.
-* Currently, we implement case 2 above.
-*
-* For the case when _CCA is missing (i.e. cca_seen=0) and
-* platform specifies ACPI_CCA_REQUIRED, we do not support DMA,
-* and fallback to arch-specific default handling.
-*
-* See acpi_init_coherency() for more info.
-*/
-   if (adev->flags.coherent_dma ||
-   (adev->flags.cca_seen && IS_ENABLED(CONFIG_ARM64))) {
-   ret = true;
-   if (coherent)
-   *coherent = adev->flags.coherent_dma;
-   }
-   return ret;
-}
-
 static inline bool is_acpi_node(struct fwnode_handle *fwnode)
 {
return fwnode && (fwnode->type == FWNODE_ACPI
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 6527920..fa2bbc0 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -593,11 +593,6 @@ static inline int acpi_device_modalias(struct device *dev,
return -ENODEV;
 }
 
-static inline bool acpi_check_dma(struct acpi_device *adev, bool *coherent)
-{
-   return false;
-}
-
 static inline bool acpi_dma_supported(struct acpi_device *adev)
 {
return false;
diff --git a/include/linux/property.h b/include/linux/property.h
index 7200490..0a3705a 100644
--- a/include/linux/property.h
+++ b/include/linux/property.h
@@ -174,8 +174,6 @@ struct property_set {
 
 void device_add_property_set(struct device *dev, struct property_set *pset);
 
-bool device_dma_is_coherent(struct device *dev);
-
 bool device_dma_supported(struct device *dev);
 
 enum dev_dma_attr device_get_dma_attr(struct device *dev);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V5 2/9] device property: Introducing enum dev_dma_attr

2015-10-28 Thread Suravee Suthikulpanit

A device could have one of the following DMA attributes:
* DMA not supported
* DMA non-coherent
* DMA coherent

So, this patch introduces enum dev_dma_attribute. This will be used by
new APIs introduced in later patches.

Signed-off-by: Suravee Suthikulpanit 
CC: Rafael J. Wysocki 
CC: Bjorn Helgaas 
---
 include/linux/property.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/property.h b/include/linux/property.h
index 463de52..8eecf20 100644
--- a/include/linux/property.h
+++ b/include/linux/property.h
@@ -27,6 +27,12 @@ enum dev_prop_type {
DEV_PROP_MAX,
 };
 
+enum dev_dma_attr {
+   DEV_DMA_NOT_SUPPORTED,
+   DEV_DMA_NON_COHERENT,
+   DEV_DMA_COHERENT,
+};
+
 bool device_property_present(struct device *dev, const char *propname);
 int device_property_read_u8_array(struct device *dev, const char *propname,
  u8 *val, size_t nval);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V5 9/9] PCI: ACPI: Add support for PCI device DMA coherency

2015-10-28 Thread Suravee Suthikulpanit

This patch adds support for setting up PCI device DMA coherency from
ACPI _CCA object that should normally be specified in the DSDT node
of its PCI host bridge.

Signed-off-by: Suravee Suthikulpanit 
Acked-by: Bjorn Helgaas 
Reviewed-by: Hanjun Guo 
CC: Rafael J. Wysocki 
---
 drivers/pci/probe.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 31e3eef..40eed54 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "pci.h"
 
@@ -1638,7 +1639,7 @@ static void pci_set_msi_domain(struct pci_dev *dev)
  * @dev: ptr to pci_dev struct of the PCI device
  *
  * Function to update PCI devices's DMA configuration using the same
- * info from the OF node of host bridge's parent (if any).
+ * info from the OF node or ACPI node of host bridge's parent (if any).
  */
 static void pci_dma_configure(struct pci_dev *dev)
 {
@@ -1647,6 +1648,15 @@ static void pci_dma_configure(struct pci_dev *dev)
if (IS_ENABLED(CONFIG_OF) && dev->dev.of_node) {
if (bridge->parent)
of_dma_configure(&dev->dev, bridge->parent->of_node);
+   } else if (has_acpi_companion(bridge)) {
+   struct acpi_device *adev = to_acpi_device_node(bridge->fwnode);
+   enum dev_dma_attr attr = acpi_get_dma_attr(adev);
+
+   if (attr == DEV_DMA_NOT_SUPPORTED)
+   dev_warn(&dev->dev, "DMA not supported.\n");
+   else
+   arch_setup_dma_ops(&dev->dev, 0, 0, NULL,
+  attr == DEV_DMA_COHERENT);
}
 
pci_put_host_bridge_device(bridge);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] __div64_32: implement division by multiplication for 32-bit arches

2015-10-28 Thread Alexey Brodkin

Existing default implementation of __div64_32() for 32-bit arches unfolds
into huge routine with tons of arithmetics like +, -, * and all of them
in loops. That leads to obvious performance degradation if do_div() is
frequently used.

Good example is extensive TCP/IP traffic.
That's what I'm getting with perf out of iperf3:
 -->8--
30.05%  iperf3   [kernel.kallsyms][k] copy_from_iter
11.77%  iperf3   [kernel.kallsyms][k] __div64_32
 5.44%  iperf3   [kernel.kallsyms][k] memset
 5.32%  iperf3   [kernel.kallsyms][k] stmmac_xmit
 2.70%  iperf3   [kernel.kallsyms][k] skb_segment
 2.56%  iperf3   [kernel.kallsyms][k] tcp_ack
 -->8--

do_div() here is mostly used in skb_mstamp_get() to convert nanoseconds
received from local_clock() to microseconds used in timestamp.
BTW conversion itself is as simple as "/=1000".

Fortunately we already have much better __div64_32() for 32-bit ARM.
There in case of division by constant preprocessor calculates so-called
"magic number" which is later used in multiplications instead of divisions.
It's really nice and very optimal but obviously works only for ARM
because ARM assembly is involved.

Now why don't we extend the same approach to all other 32-bit arches
with multiplication part implemented in pure C. With good compiler
resulting assembly will be quite close to manually written assembly.

And that change implements that.

But there's at least 1 problem which I don't know how to solve.
Preprocessor magic only happens if __div64_32() is inlined (that's
obvious - preprocessor has to know if divider is constant or not).

But __div64_32() is already marked as weak function (which in its turn
is required to allow some architectures to provide its own optimal
implementations). I.e. addition of "inline" for __div64_32() is not an
option.

So I do want to hear opinions on how to proceed with that patch.
Indeed there's the simplest solution - use this implementation only in
my architecture of preference (read ARC) but IMHO this change may
benefit other architectures as well.

Signed-off-by: Alexey Brodkin 
Cc: linux-snps-...@lists.infradead.org
Cc: Vineet Gupta 
Cc: Ingo Molnar 
Cc: Stephen Hemminger 
Cc: David S. Miller 
Cc: Nicolas Pitre 
Cc: Russell King 
---
 lib/div64.c | 153 ++--
 1 file changed, 128 insertions(+), 25 deletions(-)

diff --git a/lib/div64.c b/lib/div64.c
index 62a698a..3055328 100644
--- a/lib/div64.c
+++ b/lib/div64.c
@@ -23,37 +23,140 @@
 /* Not needed on 64bit architectures */
 #if BITS_PER_LONG == 32
 
+/* our own fls implementation to make sure constant propagation is fine */
+inline int __div64_fls(int bits)
+{
+   unsigned int __left = bits, __nr = 0;
+
+   if (__left & 0x)
+   __nr += 16, __left >>= 16;
+
+   if (__left & 0xff00)
+   __nr +=  8, __left >>=  8;
+
+   if (__left & 0x00f0)
+   __nr +=  4, __left >>=  4;
+
+   if (__left & 0x000c)
+   __nr +=  2, __left >>=  2;
+
+   if (__left & 0x0002)
+   __nr +=  1;
+
+   return __nr;
+}
+
+/*
+ * If the divisor happens to be constant, we determine the appropriate
+ * inverse at compile time to turn the division into a few inline
+ * multiplications instead which is much faster.
+ */
 uint32_t __attribute__((weak)) __div64_32(uint64_t *n, uint32_t base)
 {
-   uint64_t rem = *n;
-   uint64_t b = base;
-   uint64_t res, d = 1;
-   uint32_t high = rem >> 32;
-
-   /* Reduce the thing a bit first */
-   res = 0;
-   if (high >= base) {
-   high /= base;
-   res = (uint64_t) high << 32;
-   rem -= (uint64_t) (high*base) << 32;
-   }
+   unsigned int __r, __b = base;
 
-   while ((int64_t)b > 0 && b < rem) {
-   b = b+b;
-   d = d+d;
-   }
+   if (!__builtin_constant_p(__b) || __b == 0) {
+   /* non-constant divisor (or zero): slow path */
+   uint64_t rem = *n;
+   uint64_t b = base;
+   uint64_t res, d = 1;
+   uint32_t high = rem >> 32;
+
+   /* Reduce the thing a bit first */
+   res = 0;
+   if (high >= base) {
+   high /= base;
+   res = (uint64_t) high << 32;
+   rem -= (uint64_t) (high*base) << 32;
+   }
+
+   while ((int64_t)b > 0 && b < rem) {
+   b = b+b;
+   d = d+d;
+   }
+
+   do {
+   if (rem >= b) {
+   rem -= b;
+   res += d;
+   }
+   b >>= 1;
+   d >>= 1;
+   } while (d);
 
-   do {
-   if (rem >= b)

Re: [PATCH 1/5] mtd: ofpart: grab device tree node directly from master device node

2015-10-28 Thread Marek Vasut

On Wednesday, October 28, 2015 at 09:55:24 PM, Robert Jarzmik wrote:
> Brian Norris  writes:
> >> > Do some sorts of chipselects come into play here ? Ie. you can have
> >> > one master with multiple NAND chips connected to it.
> >> 
> >> Most NAND controllers support interacting with several chips (or
> >> dies in case your chip embeds several NAND dies), but I keep thinking
> >> each physical chip should have its own instance of nand_chip + mtd_info.
> >> If you want to have a single mtd device aggregating several chips you
> >> can use mtdconcat.
> >> 
> >> This leaves the multi-dies chip case, and IHMO we should represent those
> >> chips as a single entity, and I guess that's the purpose of the
> >> ->numchips field in nand_chip (if your chip embeds 2 dies with 2 CS
> >> lines, then ->numchips should be 2).
> > 
> > Yes, I think that's some of the intention there. And so even in that
> > case, a multi-die chip gets represented as a single struct nand_chip.
> 
> Isn't there the case of a single NAND controller with 2 identical chips,
> each a 8 bit NAND chip, and the controller aggregating them to offer the
> OS a single 16-bit NAND chip ?

Is that using 1 or 2 physical chipselect lines on the CPU (controller) ?

> In this case, the controller (pxa3xx is a good example) will be programmed
> to handle both chips at the same time, and calculate CRC on both chips,
> etc ... I hope the assertion "physical chip should have its own instance
> of nand_chip + mtd_info" does take into account this example.
> 
> I don't know if there is actually any user of this for either pxa3xx or
> another controller, nor if there is any value in this.
> 
> Cheers.

Best regards,
Marek Vasut
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] uapi: mqueue.h: add missing linux/types.h include

2015-10-28 Thread Mike Frysinger

From: Mike Frysinger 

Commit 63159f5dcccb3858d88aaef800c4ee0eb4cc8577 changed the types from
long to __kernel_long_t, but didn't add a linux/types.h include.  Code
that tries to include this header directly breaks:

/usr/include/linux/mqueue.h:26:2: error: unknown type name '__kernel_long_t'
  __kernel_long_t mq_flags; /* message queue flags   */

This also upsets configure tests for this header:
checking linux/mqueue.h usability... no
checking linux/mqueue.h presence... yes
configure: WARNING: linux/mqueue.h: present but cannot be compiled
configure: WARNING: linux/mqueue.h: check for missing prerequisite headers?
configure: WARNING: linux/mqueue.h: see the Autoconf documentation
configure: WARNING: linux/mqueue.h: section "Present But Cannot Be Compiled"
configure: WARNING: linux/mqueue.h: proceeding with the compiler's result
checking for linux/mqueue.h... no

Signed-off-by: Mike Frysinger 
---
 include/uapi/linux/mqueue.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/mqueue.h b/include/uapi/linux/mqueue.h
index d0a2b8e..bbd5116 100644
--- a/include/uapi/linux/mqueue.h
+++ b/include/uapi/linux/mqueue.h
@@ -18,6 +18,8 @@
 #ifndef _LINUX_MQUEUE_H
 #define _LINUX_MQUEUE_H
 
+#include 
+
 #define MQ_PRIO_MAX32768
 /* per-uid limit of kernel memory used by mqueue, in bytes */
 #define MQ_BYTES_MAX   819200
-- 
2.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [Intel-wired-lan] [PATCH] fm10k:Fix error handling in the function fm10k_resume

2015-10-28 Thread Singh, Krishneil K



-Original Message-
From: Intel-wired-lan [mailto:intel-wired-lan-boun...@lists.osuosl.org] On 
Behalf Of Nicholas Krause
Sent: Saturday, October 17, 2015 9:21 AM
To: Kirsher, Jeffrey T 
Cc: linux-kernel@vger.kernel.org; intel-wired-...@lists.osuosl.org; 
net...@vger.kernel.org
Subject: [Intel-wired-lan] [PATCH] fm10k:Fix error handling in the function 
fm10k_resume

This fixes error handling to proper check if the call to the function 
fm10k_mbx_request_irq has failed by returning a error code and if so return 
immediately to the caller of fm10k_resume to properly signal a failure has 
occurred when accepting to resume this network

Signed-off-by: Nicholas Krause 
---

Tested-by: Krishneil SIngh 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lustre-devel] [PATCH 09/10] staging: lustre: fix remaining checkpatch issues for libcfs_hash.h

2015-10-28 Thread Dilger, Andreas

On 2015/10/28, 10:54, "lustre-devel on behalf of James Simmons"
 wrote:

>From: James Simmons 
>
>Final cleanup to make libcfs_hash.h completely kernel standard
>compliant.
>
>Signed-off-by: James Simmons 
>---
> .../lustre/include/linux/libcfs/libcfs_hash.h  |   16
>++--
> 1 files changed, 10 insertions(+), 6 deletions(-)
>
>diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_hash.h
>b/drivers/staging/lustre/include/linux/libcfs/libcfs_hash.h
>index 5df8ba2..563b2b4 100644
>--- a/drivers/staging/lustre/include/linux/libcfs/libcfs_hash.h
>+++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_hash.h
>@@ -62,7 +62,8 @@
> /** disable debug */
> #define CFS_HASH_DEBUG_NONE   0
> /** record hash depth and output to console when it's too deep,
>- *  computing overhead is low but consume more memory */
>+ *  computing overhead is low but consume more memory
>+ */

Typically, multi-line comments have the leading /* on a separate line
from the first line of text.  If you are changing all these comments
you may as well make it consistent with the kernel style.

Cheers, Andreas

> #define CFS_HASH_DEBUG_1  1
> /** expensive, check key validation */
> #define CFS_HASH_DEBUG_2  2
>@@ -158,7 +159,8 @@ enum cfs_hash_tag {
>*/
>   CFS_HASH_NBLK_CHANGE= 1 << 13,
>   /** NB, we typed hs_flags as  __u16, please change it
>-   * if you need to extend >=16 flags */
>+   * if you need to extend >=16 flags
>+   */
> };
> 
> /** most used attributes */
>@@ -206,7 +208,8 @@ enum cfs_hash_tag {
> 
> struct cfs_hash {
>   /** serialize with rehash, or serialize all operations if
>-   * the hash-table has CFS_HASH_NO_BKTLOCK */
>+   * the hash-table has CFS_HASH_NO_BKTLOCK
>+   */
>   union cfs_hash_lock hs_lock;
>   /** hash operations */
>   struct cfs_hash_ops *hs_ops;
>@@ -375,7 +378,8 @@ cfs_hash_with_no_itemref(struct cfs_hash *hs)
> {
>   /* hash-table doesn't keep refcount on item,
>* item can't be removed from hash unless it's
>-   * ZERO refcount */
>+   * ZERO refcount.
>+   */
>   return (hs->hs_flags & CFS_HASH_NO_ITEMREF) != 0;
> }
> 
>@@ -820,7 +824,7 @@ cfs_hash_djb2_hash(const void *key, size_t size,
>unsigned mask)
> {
>   unsigned i, hash = 5381;
> 
>-  LASSERT(key != NULL);
>+  LASSERT(key);
> 
>   for (i = 0; i < size; i++)
>   hash = hash * 33 + ((char *)key)[i];
>@@ -848,7 +852,7 @@ cfs_hash_u64_hash(const __u64 key, unsigned mask)
> 
> /** iterate over all buckets in @bds (array of struct cfs_hash_bd) */
> #define cfs_hash_for_each_bd(bds, n, i)   \
>-  for (i = 0; i < n && (bds)[i].bd_bucket != NULL; i++)
>+  for (i = 0; i < n && (bds)[i].bd_bucket; i++)
> 
> /** iterate over all buckets of @hs */
> #define cfs_hash_for_each_bucket(hs, bd, pos) \
>-- 
>1.7.1
>
>___
>lustre-devel mailing list
>lustre-de...@lists.lustre.org
>http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
>


Cheers, Andreas
-- 
Andreas Dilger

Lustre Software Architect
Intel High Performance Data Division


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lustre-devel] [PATCH 08/10] staging: lustre: remove white space in libcfs_hash.h

2015-10-28 Thread Dilger, Andreas

On 2015/10/28, 10:54, "lustre-devel on behalf of James Simmons"
 wrote:

>From: James Simmons 
>
>Cleanup all the unneeded white space in libcfs_hash.h.
>
>Signed-off-by: James Simmons 

Minor note - it would be better to keep these two email addresses
consistent.

>struct cfs_hash_bd {
>-  struct cfs_hash_bucket  *bd_bucket;  /**< address of bucket */
>-  unsigned intbd_offset;  /**< offset in bucket */
>+  /**< address of bucket */
>+  struct cfs_hash_bucket  *bd_bucket;
>+  /**< offset in bucket */
>+  unsigned int bd_offset;
> };

The "/**< ... */" marker means "the field to the left", but if you are
moving these to the line before the field you should just use "/* ... */".

Cheers, Andreas

> 
>-#define CFS_HASH_NAME_LEN16  /**< default name length */
>-#define CFS_HASH_BIGNAME_LEN  64  /**< bigname for param tree */
>+#define CFS_HASH_NAME_LEN 16  /**< default name length */
>+#define CFS_HASH_BIGNAME_LEN  64  /**< bigname for param tree */
> 
>-#define CFS_HASH_BKT_BITS3   /**< default bits of bucket */
>-#define CFS_HASH_BITS_MAX30  /**< max bits of bucket */
>-#define CFS_HASH_BITS_MINCFS_HASH_BKT_BITS
>+#define CFS_HASH_BKT_BITS 3   /**< default bits of bucket */
>+#define CFS_HASH_BITS_MAX 30  /**< max bits of bucket */
>+#define CFS_HASH_BITS_MIN CFS_HASH_BKT_BITS
> 
> /**
>  * common hash attributes.
>@@ -133,41 +129,41 @@ enum cfs_hash_tag {
>*/
>   CFS_HASH_NO_LOCK= 1 << 0,
>   /** no bucket lock, use one spinlock to protect the whole hash */
>-  CFS_HASH_NO_BKTLOCK = 1 << 1,
>+  CFS_HASH_NO_BKTLOCK = 1 << 1,
>   /** rwlock to protect bucket */
>-  CFS_HASH_RW_BKTLOCK = 1 << 2,
>+  CFS_HASH_RW_BKTLOCK = 1 << 2,
>   /** spinlock to protect bucket */
>-  CFS_HASH_SPIN_BKTLOCK   = 1 << 3,
>+  CFS_HASH_SPIN_BKTLOCK   = 1 << 3,
>   /** always add new item to tail */
>-  CFS_HASH_ADD_TAIL   = 1 << 4,
>+  CFS_HASH_ADD_TAIL   = 1 << 4,
>   /** hash-table doesn't have refcount on item */
>-  CFS_HASH_NO_ITEMREF = 1 << 5,
>+  CFS_HASH_NO_ITEMREF = 1 << 5,
>   /** big name for param-tree */
>   CFS_HASH_BIGNAME= 1 << 6,
>   /** track global count */
>   CFS_HASH_COUNTER= 1 << 7,
>   /** rehash item by new key */
>-  CFS_HASH_REHASH_KEY = 1 << 8,
>+  CFS_HASH_REHASH_KEY = 1 << 8,
>   /** Enable dynamic hash resizing */
>-  CFS_HASH_REHASH  = 1 << 9,
>+  CFS_HASH_REHASH = 1 << 9,
>   /** can shrink hash-size */
>-  CFS_HASH_SHRINK  = 1 << 10,
>+  CFS_HASH_SHRINK = 1 << 10,
>   /** assert hash is empty on exit */
>-  CFS_HASH_ASSERT_EMPTY   = 1 << 11,
>+  CFS_HASH_ASSERT_EMPTY   = 1 << 11,
>   /** record hlist depth */
>-  CFS_HASH_DEPTH= 1 << 12,
>+  CFS_HASH_DEPTH  = 1 << 12,
>   /**
>* rehash is always scheduled in a different thread, so current
>* change on hash table is non-blocking
>*/
>-  CFS_HASH_NBLK_CHANGE= 1 << 13,
>+  CFS_HASH_NBLK_CHANGE= 1 << 13,
>   /** NB, we typed hs_flags as  __u16, please change it
>* if you need to extend >=16 flags */
> };
> 
> /** most used attributes */
>-#define CFS_HASH_DEFAULT   (CFS_HASH_RW_BKTLOCK | \
>-  CFS_HASH_COUNTER | CFS_HASH_REHASH)
>+#define CFS_HASH_DEFAULT  (CFS_HASH_RW_BKTLOCK | \
>+   CFS_HASH_COUNTER | CFS_HASH_REHASH)
> 
> /**
>  * cfs_hash is a hash-table implementation for general purpose, it can
>support:
>@@ -211,7 +207,7 @@ enum cfs_hash_tag {
> struct cfs_hash {
>   /** serialize with rehash, or serialize all operations if
>* the hash-table has CFS_HASH_NO_BKTLOCK */
>-  union cfs_hash_lock  hs_lock;
>+  union cfs_hash_lock hs_lock;
>   /** hash operations */
>   struct cfs_hash_ops *hs_ops;
>   /** hash lock operations */
>@@ -219,57 +215,57 @@ struct cfs_hash {
>   /** hash list operations */
>   struct cfs_hash_hlist_ops   *hs_hops;
>   /** hash buckets-table */
>-  struct cfs_hash_bucket   **hs_buckets;
>+  struct cfs_hash_bucket  **hs_buckets;
>   /** total number of items on this hash-table */
>-  atomic_ths_count;
>+  atomic_ths_count;
>   /** hash flags, see cfs_hash_tag for detail */
>-  __u16  hs_flags;
>+  __u16   hs_flags;
>   /** # of extra-bytes for bucket, for user saving extended attributes */
>-  __u16  hs_extra_bytes;
>+  __u16   hs_extra_bytes;
>   /** wants to iterate */
>-  __u8hs_iterating;
>+  __u8hs_iterating;
>   /*

Re: [patch 3/3] vmstat: Create our own workqueue

2015-10-28 Thread Christoph Lameter

On Wed, 28 Oct 2015, Tetsuo Handa wrote:

> Christoph Lameter wrote:
> > On Wed, 28 Oct 2015, Tejun Heo wrote:
> >
> > > The only thing necessary here is WQ_MEM_RECLAIM.  I don't see how
> > > WQ_SYSFS and WQ_FREEZABLE make sense here.
> >
> I can still trigger silent livelock with this patchset applied.

Ok so why the vmstat updater still deferred, Tejun?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness

2015-10-28 Thread Jeff Moyer

Ross Zwisler  writes:

> This series implements the very slow but correct handling for
> blkdev_issue_flush() with DAX mappings, as discussed here:
>
> https://lkml.org/lkml/2015/10/26/116
>
> I don't think that we can actually do the
>
> on_each_cpu(sync_cache, ...);
>
> ...where sync_cache is something like:
>
> cache_disable();
> wbinvd();
> pcommit();
> cache_enable();
>
> solution as proposed by Dan because WBINVD + PCOMMIT doesn't guarantee that
> your writes actually make it durably onto the DIMMs.  I believe you really do
> need to loop through the cache lines, flush them with CLWB, then fence and
> PCOMMIT.

*blink*
*blink*

So much for not violating the principal of least surprise.  I suppose
you've asked the hardware folks, and they've sent you down this path?

> I do worry that the cost of blindly flushing the entire PMEM namespace on each
> fsync or msync will be prohibitively expensive, and that we'll by very
> incentivized to move to the radix tree based dirty page tracking as soon as
> possible. :)

Sure, but wbinvd would be quite costly as well.  Either way I think a
better solution will be required in the near term.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: 32-bit __data_len and REQ_DISCARD+REQ_SECURE

2015-10-28 Thread Jeff Moyer

Ulf Hansson  writes:

> I am not sure if this issue is the same as been discussed earlier on
> the mmc list regarding "discard/erase".
>
> Anyway, there have been several attempts to fix bugs related to this.
> One of these discussion kind of pointed out a viable solution, but
> unfortunate no patches that adopts that solution have been posted yet.
>
> You might want to read up on this.
> https://www.mail-archive.com/linux-mmc@vger.kernel.org/msg23643.html
> http://linux-mmc.vger.kernel.narkive.com/Wp31G953/patch-mmc-core-don-t-return-1-for-max-discard
>
> So this is an old issue, which should have been fixed long long long time 
> ago...

Thanks Ulf.  After reading all of the linked discussions, it's my
understanding that this is an emmc-specific issue that doesn't require
any block layer changes.  If that's wrong, please let me know.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] audit: removing unused variable

2015-10-28 Thread Joe Perches

On Wed, 2015-10-28 at 16:35 -0400, Paul Moore wrote:
> On Wednesday, October 28, 2015 09:40:34 AM Saurabh Sengar wrote:
> > variavle rc in not required as it is just used for unchanged for return,
> > and return is always 0 in the function.
[]
> Thanks, applied with some spelling corrections to the description.

As the return value is never actually tested,
it seems better to make it a void function,

> > diff --git a/kernel/audit.c b/kernel/audit.c
[]
> > @@ -686,23 +686,22 @@ static int audit_netlink_ok(struct sk_buff *skb, u16
> > msg_type)
> > 
> >  static int audit_log_common_recv_msg(struct audit_buffer **ab, u16
> > msg_type) {
> > -   int rc = 0;
> > uid_t uid = from_kuid(&init_user_ns, current_uid());
> > pid_t pid = task_tgid_nr(current);
> > 
> > if (!audit_enabled && msg_type != AUDIT_USER_AVC) {
> > *ab = NULL;
> > -   return rc;
> > +   return 0;
> > }
> > 
> > *ab = audit_log_start(NULL, GFP_KERNEL, msg_type);
> > if (unlikely(!*ab))
> > -   return rc;
> > +   return 0;
> > audit_log_format(*ab, "pid=%d uid=%u", pid, uid);
> > audit_log_session_info(*ab);
> > audit_log_task_context(*ab);
> > 
> > -   return rc;
> > +   return 0;
> >  }
> > 
> >  int is_audit_feature_set(int i)
> 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RESEND PATCH] scsi_sysfs: Fix queue_ramp_up_period return code

2015-10-28 Thread Matthew R. Ochs

> On Oct 27, 2015, at 4:49 AM, Peter Oberparleiter  
> wrote:
> 
> Writing a number to /sys/bus/scsi/devices//queue_ramp_up_period
> returns the value of that number instead of the number of bytes written.
> This behavior can confuse programs expecting POSIX write() semantics.
> Fix this by returning the number of bytes written instead.
> 
> Signed-off-by: Peter Oberparleiter 
> Reviewed-by: Hannes Reinecke 
> Cc: sta...@vger.kernel.org
> ---
> drivers/scsi/scsi_sysfs.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> index b89..6b0f292 100644
> --- a/drivers/scsi/scsi_sysfs.c
> +++ b/drivers/scsi/scsi_sysfs.c
> @@ -898,7 +898,7 @@ sdev_store_queue_ramp_up_period(struct device *dev,
>   return -EINVAL;
> 
>   sdev->queue_ramp_up_period = msecs_to_jiffies(period);
> - return period;
> + return count;
> }
> 
> static DEVICE_ATTR(queue_ramp_up_period, S_IRUGO | S_IWUSR,

Reviewed-by: Matthew R. Ochs 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] pmem: add wb_cache_pmem() to the PMEM API

2015-10-28 Thread Ross Zwisler

The function __arch_wb_cache_pmem() was already an internal implementation
detail of the x86 PMEM API, but this functionality needs to be exported as
part of the general PMEM API to handle the fsync/msync case for DAX mmaps.

Signed-off-by: Ross Zwisler 
---
 arch/x86/include/asm/pmem.h | 11 ++-
 include/linux/pmem.h| 22 +-
 2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index d8ce3ec..6c7ade0 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -67,18 +67,19 @@ static inline void arch_wmb_pmem(void)
 }
 
 /**
- * __arch_wb_cache_pmem - write back a cache range with CLWB
+ * arch_wb_cache_pmem - write back a cache range with CLWB
  * @vaddr: virtual start address
  * @size:  number of bytes to write back
  *
  * Write back a cache range using the CLWB (cache line write back)
  * instruction.  This function requires explicit ordering with an
- * arch_wmb_pmem() call.  This API is internal to the x86 PMEM implementation.
+ * arch_wmb_pmem() call.
  */
-static inline void __arch_wb_cache_pmem(void *vaddr, size_t size)
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
 {
u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
unsigned long clflush_mask = x86_clflush_size - 1;
+   void *vaddr = (void __force *)addr;
void *vend = vaddr + size;
void *p;
 
@@ -115,7 +116,7 @@ static inline size_t arch_copy_from_iter_pmem(void __pmem 
*addr, size_t bytes,
len = copy_from_iter_nocache(vaddr, bytes, i);
 
if (__iter_needs_pmem_wb(i))
-   __arch_wb_cache_pmem(vaddr, bytes);
+   arch_wb_cache_pmem(addr, bytes);
 
return len;
 }
@@ -138,7 +139,7 @@ static inline void arch_clear_pmem(void __pmem *addr, 
size_t size)
else
memset(vaddr, 0, size);
 
-   __arch_wb_cache_pmem(vaddr, size);
+   arch_wb_cache_pmem(addr, size);
 }
 
 static inline bool __arch_has_wmb_pmem(void)
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 85f810b3..2cd5003 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, 
size_t size)
 {
BUG();
 }
+
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
+{
+   BUG();
+}
 #endif
 
 /*
  * Architectures that define ARCH_HAS_PMEM_API must provide
  * implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(),
- * arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem().
+ * arch_copy_from_iter_pmem(), arch_clear_pmem(), arch_wb_cache_pmem()
+ * and arch_has_wmb_pmem().
  */
 static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t 
size)
 {
@@ -202,4 +208,18 @@ static inline void clear_pmem(void __pmem *addr, size_t 
size)
else
default_clear_pmem(addr, size);
 }
+
+/**
+ * wb_cache_pmem - write back processor cache for PMEM memory range
+ * @addr:  virtual start address
+ * @size:  number of bytes to write back
+ *
+ * Write back the processor cache range starting at 'addr' for 'size' bytes.
+ * This function requires explicit ordering with a wmb_pmem() call.
+ */
+static inline void wb_cache_pmem(void __pmem *addr, size_t size)
+{
+   if (arch_has_pmem_api())
+   arch_wb_cache_pmem(addr, size);
+}
 #endif /* __PMEM_H__ */
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] pmem: Add simple and slow fsync/msync support

2015-10-28 Thread Ross Zwisler

Make blkdev_issue_flush() behave correctly according to its required
semantics - all volatile cached data is flushed to stable storage.

Eventually this needs to be replaced with something much more precise by
tracking dirty DAX entries via the radix tree in struct address_space, but
for now this gives us correctness even if the performance is quite bad.

Userspace applications looking to avoid the fsync/msync penalty should
consider more fine-grained flushing via the NVML library:

https://github.com/pmem/nvml

Signed-off-by: Ross Zwisler 
---
 drivers/nvdimm/pmem.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 0ba6a97..eea7997 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -80,7 +80,14 @@ static void pmem_make_request(struct request_queue *q, 
struct bio *bio)
if (do_acct)
nd_iostat_end(bio, start);
 
-   if (bio_data_dir(bio))
+   if (bio->bi_rw & REQ_FLUSH) {
+   void __pmem *addr = pmem->virt_addr + pmem->data_offset;
+   size_t size = pmem->size - pmem->data_offset;
+
+   wb_cache_pmem(addr, size);
+   }
+
+   if (bio_data_dir(bio) || (bio->bi_rw & REQ_FLUSH))
wmb_pmem();
 
bio_endio(bio);
@@ -189,6 +196,7 @@ static int pmem_attach_disk(struct device *dev,
blk_queue_physical_block_size(pmem->pmem_queue, PAGE_SIZE);
blk_queue_max_hw_sectors(pmem->pmem_queue, UINT_MAX);
blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
+   blk_queue_flush(pmem->pmem_queue, REQ_FLUSH);
queue_flag_set_unlocked(QUEUE_FLAG_NONROT, pmem->pmem_queue);
 
disk = alloc_disk(0);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/2] "big hammer" for DAX msync/fsync correctness

2015-10-28 Thread Ross Zwisler

This series implements the very slow but correct handling for
blkdev_issue_flush() with DAX mappings, as discussed here:

https://lkml.org/lkml/2015/10/26/116

I don't think that we can actually do the

on_each_cpu(sync_cache, ...);

...where sync_cache is something like:

cache_disable();
wbinvd();
pcommit();
cache_enable();

solution as proposed by Dan because WBINVD + PCOMMIT doesn't guarantee that
your writes actually make it durably onto the DIMMs.  I believe you really do
need to loop through the cache lines, flush them with CLWB, then fence and
PCOMMIT.

I do worry that the cost of blindly flushing the entire PMEM namespace on each
fsync or msync will be prohibitively expensive, and that we'll by very
incentivized to move to the radix tree based dirty page tracking as soon as
possible. :)

Ross Zwisler (2):
  pmem: add wb_cache_pmem() to the PMEM API
  pmem: Add simple and slow fsync/msync support

 arch/x86/include/asm/pmem.h | 11 ++-
 drivers/nvdimm/pmem.c   | 10 +-
 include/linux/pmem.h| 22 +-
 3 files changed, 36 insertions(+), 7 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/1] cpufreq: interactive: New 'interactive' governor

2015-10-28 Thread Bálint Czobor

From: Mike Chan 

This governor is designed for latency-sensitive workloads, such as
interactive user interfaces.  The interactive governor aims to be
significantly more responsive to ramp CPU quickly up when CPU-intensive
activity begins.

Existing governors sample CPU load at a particular rate, typically
every X ms.  This can lead to under-powering UI threads for the period of
time during which the user begins interacting with a previously-idle system
until the next sample period happens.

The 'interactive' governor uses a different approach. Instead of sampling
the CPU at a specified rate, the governor will check whether to scale the
CPU frequency up soon after coming out of idle.  When the CPU comes out of
idle, a timer is configured to fire within 1-2 ticks.  If the CPU is very
busy from exiting idle to when the timer fires then we assume the CPU is
underpowered and ramp to MAX speed.

If the CPU was not sufficiently busy to immediately ramp to MAX speed, then
the governor evaluates the CPU load since the last speed adjustment,
choosing the highest value between that longer-term load or the short-term
load since idle exit to determine the CPU speed to ramp to.

A realtime thread is used for scaling up, giving the remaining tasks the
CPU performance benefit, unlike existing governors which are more likely to
schedule rampup work to occur after your performance starved tasks have
completed.

The tuneables for this governor are:
/sys/devices/system/cpu/cpufreq/interactive/min_sample_time:
The minimum amount of time to spend at the current frequency before
ramping down. This is to ensure that the governor has seen enough
historic CPU load data to determine the appropriate workload.
Default is 8 uS.
/sys/devices/system/cpu/cpufreq/interactive/go_maxspeed_load
The CPU load at which to ramp to max speed.  Default is 85.

Signed-off-by: Mike Chan 
Signed-off-by: Todd Poynor 
Signed-off-by: Bálint Czobor 
---
 Documentation/cpu-freq/governors.txt   |   37 +
 drivers/cpufreq/Kconfig|   27 +
 drivers/cpufreq/Makefile   |1 +
 drivers/cpufreq/cpufreq_interactive.c  | 1338 
 include/linux/cpufreq.h|3 +
 include/trace/events/cpufreq_interactive.h |  112 +++
 6 files changed, 1518 insertions(+)
 create mode 100644 drivers/cpufreq/cpufreq_interactive.c
 create mode 100644 include/trace/events/cpufreq_interactive.h

diff --git a/Documentation/cpu-freq/governors.txt 
b/Documentation/cpu-freq/governors.txt
index c15aa75..b262c53 100644
--- a/Documentation/cpu-freq/governors.txt
+++ b/Documentation/cpu-freq/governors.txt
@@ -28,6 +28,7 @@ Contents:
 2.3  Userspace
 2.4  Ondemand
 2.5  Conservative
+2.6  Interactive
 
 3.   The Governor Interface in the CPUfreq Core
 
@@ -218,6 +219,42 @@ a decision on when to decrease the frequency while running 
in any
 speed. Load for frequency increase is still evaluated every
 sampling rate.
 
+2.6 Interactive
+---
+
+The CPUfreq governor "interactive" is designed for latency-sensitive,
+interactive workloads. This governor sets the CPU speed depending on
+usage, similar to "ondemand" and "conservative" governors.  However,
+the governor is more aggressive about scaling the CPU speed up in
+response to CPU-intensive activity.
+
+Sampling the CPU load every X ms can lead to under-powering the CPU
+for X ms, leading to dropped frames, stuttering UI, etc.  Instead of
+sampling the cpu at a specified rate, the interactive governor will
+check whether to scale the cpu frequency up soon after coming out of
+idle.  When the cpu comes out of idle, a timer is configured to fire
+within 1-2 ticks.  If the cpu is very busy between exiting idle and
+when the timer fires then we assume the cpu is underpowered and ramp
+to MAX speed.
+
+If the cpu was not sufficiently busy to immediately ramp to MAX speed,
+then governor evaluates the cpu load since the last speed adjustment,
+choosing the highest value between that longer-term load or the
+short-term load since idle exit to determine the cpu speed to ramp to.
+
+The tuneable values for this governor are:
+
+min_sample_time: The minimum amount of time to spend at the current
+frequency before ramping down. This is to ensure that the governor has
+seen enough historic cpu load data to determine the appropriate
+workload.  Default is 8 uS.
+
+go_maxspeed_load: The CPU load at which to ramp to max speed.  Default
+is 85.
+
+timer_rate: Sample rate for reevaluating cpu load when the system is
+not idle.  Default is 3 uS.
+
 3. The Governor Interface in the CPUfreq Core
 =
 
diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index 659879a..6e099e5 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -102,6 +102,16 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
  Be aware that not all cpufreq drivers support the cons

Re: [RFC PATCH] VFIO: Add a parameter to force nonthread IRQ

2015-10-28 Thread Yunhong Jiang

On Wed, Oct 28, 2015 at 12:18:48PM -0600, Alex Williamson wrote:
> On Wed, 2015-10-28 at 10:50 -0700, Yunhong Jiang wrote:
> > On Wed, Oct 28, 2015 at 01:44:55AM +0100, Paolo Bonzini wrote:
> 
> It's in linux-next via the kvm.git next branch:
> 
> git://git.kernel.org/pub/scm/virt/kvm/kvm.git
> 
> Thanks,
> Alex

Thanks

--jyh

> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/5] iov: Update virtfn_max_buses to validate offset and stride

2015-10-28 Thread Alexander Duyck


On 10/28/2015 11:43 AM, Bjorn Helgaas wrote:

On Wed, Oct 28, 2015 at 11:32:16AM -0500, Bjorn Helgaas wrote:

Hi Alex,

Thanks a lot for cleaning this up.  I think this is a great
improvement over what I did.

On Tue, Oct 27, 2015 at 01:52:15PM -0700, Alexander Duyck wrote:

This patch pulls the validation of offset and stride into virtfn_max_buses.
The general idea is to validate offset and stride for each possible value
of numvfs in addition to still determining the maximum bus value for the
VFs.

I also reversed the loop as the most likely maximum will be when numvfs is
set to total_VFs.  In addition this makes it so that we loop down to a
value of 0 for numvfs which should be the resting state for the register.

Fixes: 8e20e89658f2 ("PCI: Set SR-IOV NumVFs to zero after enumeration")
Signed-off-by: Alexander Duyck 


I'd like to squash this together with my patch instead of having fixes
on top of fixes.  What do you think of the following?  (This applies
on top of 70675e0b6a1a ("PCI: Don't try to restore VF BARs")).


commit c20e11b572c5d4e4f01c86580a133122fbd13cfa
Author: Alexander Duyck 
Date:   Wed Oct 28 10:54:32 2015 -0500

 PCI: Set SR-IOV NumVFs to zero after enumeration

 The enumeration path should leave NumVFs set to zero.  But after
 4449f079722c ("PCI: Calculate maximum number of buses required for VFs"),
 we call virtfn_max_buses() in the enumeration path, which changes NumVFs.
 This NumVFs change is visible via lspci and sysfs until a driver enables
 SR-IOV.

 Iterate from TotalVFs down to zero so NumVFs is zero when we're finished
 computing the maximum number of buses.  Validate offset and stride in
 the loop, so we can test it at every possible NumVFs setting.  Rename
 virtfn_max_buses() to compute_max_vf_buses() to hint that it does have a
 side effect of updating iov->max_VF_buses.

 [bhelgaas: changelog, rename, reverse sense of error path]
 Fixes: 4449f079722c ("PCI: Calculate maximum number of buses required for 
VFs")
 Based-on-patch-by: Ethan Zhao 
 Signed-off-by: Alexander Duyck 
 Signed-off-by: Bjorn Helgaas 

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index ee0ebff..120cfb3 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -54,24 +54,33 @@ static inline void pci_iov_set_numvfs(struct pci_dev *dev, 
int nr_virtfn)
   * The PF consumes one bus number.  NumVFs, First VF Offset, and VF Stride
   * determine how many additional bus numbers will be consumed by VFs.
   *
- * Iterate over all valid NumVFs and calculate the maximum number of bus
- * numbers that could ever be required.
+ * Iterate over all valid NumVFs, validate offset and stride, and calculate
+ * the maximum number of bus numbers that could ever be required.
   */
-static inline u8 virtfn_max_buses(struct pci_dev *dev)
+static int compute_max_vf_buses(struct pci_dev *dev)
  {
struct pci_sriov *iov = dev->sriov;
-   int nr_virtfn;
-   u8 max = 0;
+   int nr_virtfn = iov->total_VFs;
int busnr;

-   for (nr_virtfn = 1; nr_virtfn <= iov->total_VFs; nr_virtfn++) {
-   pci_iov_set_numvfs(dev, nr_virtfn);
+   pci_iov_set_numvfs(dev, nr_virtfn);
+
+   while (nr_virtfn--) {
+   if (!iov->offset || !iov->stride)
+   goto err;


I think we have a minor problem here.  In sriov_enable(), we return an
error if "nr_virtfn > 1 && !iov->stride", so it's legal for stride to
be zero if NumVF is 1.  Here we don't allow that.  Sec 3.3.10 says:

   Note: VF Stride is unused if NumVFs is 0 or 1.  If NumVFs is greater
   than 1, VF Stride must not be zero."

So I think we should allow "stride == 0" here when NumVFs is 1.


Right, we shouldn't be testing it if NumVFs is 1 or less.


+
busnr = pci_iov_virtfn_bus(dev, nr_virtfn - 1);


I think this loop management is slightly wrong: I don't think we ever
compute busnr for the highest VF because we always decrement nr_virtfn
after calling pci_iov_set_numvfs(), and then we subtract one again.
E.g., if Total VFs is 8, the VFs are numbered VF0..VF7, and we have
this, which doesn't check VF7:

   nr_virtfn = iov->total_VFs # nr_virtfn == 8
   pci_iov_set_numvfs(..., nr_virtfn) # passes 8 (correct)
   while (nr_virtfn--) {
  # nr_virtfn == 7 in loop body
 pci_iov_virtfn_bus(..., nr_virtfn - 1)   # passes 6 (wrong)



Yeah, that was supposed to just be nr_virtfn.


-   if (busnr > max)
-   max = busnr;
+   if (busnr > iov->max_VF_buses)
+   iov->max_VF_buses = busnr;
+
+   pci_iov_set_numvfs(dev, nr_virtfn);
}

-   return max;
+   return 0;
+
+err:
+   pci_iov_set_numvfs(dev, 0);
+   return -EIO;
  }


Here's my new proposal:

   static int compute_max_vf_buses(struct pci_dev *dev)
   {
   struct pci_sriov *iov = dev->sriov;
   int nr_vi

Re: [PATCH] vfio: Include No-IOMMU mode

2015-10-28 Thread Michael S. Tsirkin

On Wed, Oct 28, 2015 at 03:21:45PM -0600, Alex Williamson wrote:
> There is really no way to safely give a user full access to a DMA
> capable device without an IOMMU to protect the host system.  There is
> also no way to provide DMA translation, for use cases such as device
> assignment to virtual machines.  However, there are still those users
> that want userspace drivers even under those conditions.  The UIO
> driver exists for this use case, but does not provide the degree of
> device access and programming that VFIO has.  In an effort to avoid
> code duplication, this introduces a No-IOMMU mode for VFIO.
> 
> This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
> the "enable_unsafe_noiommu_mode" option on the vfio driver.  This
> should make it very clear that this mode is not safe.  Additionally,
> CAP_SYS_RAWIO privileges are necessary to work with groups and
> containers using this mode.  Groups making use of this support are
> named /dev/vfio/noiommu-$GROUP and can only make use of the special
> VFIO_NOIOMMU_IOMMU for the container.  Use of this mode, specifically
> binding a device without a native IOMMU group to a VFIO bus driver
> will taint the kernel and should therefore not be considered
> supported.  This patch includes no-iommu support for the vfio-pci bus
> driver only.
> 
> Signed-off-by: Alex Williamson 
> ---
> 
> This is pretty well the same as RFCv2, I've changed the pr_warn to a
> dev_warn and added another, printing the pid and comm of the task when
> it actually opens the device.  If Stephen can port the driver code
> over and prove that this actually works sometime next week, and there
> aren't any objections to this code, I'll include it in a pull request
> for the next merge window.  MST, I dropped your ack due to the
> changes, but I'll be happy to add it back if you like.  Thanks,
> 
> Alex

Yea. This actually can be used safely with devices that don't do DMA.

And given that people seem determined to poke at devices from userspace
even when there's no IOMMU, we are probably better off with supporting
the use-case in vfio - at least this way code will be easier to port
over once hypervisors do support IOMMUs.

Acked-by: Michael S. Tsirkin 


>  drivers/vfio/Kconfig|   15 +++
>  drivers/vfio/pci/vfio_pci.c |8 +-
>  drivers/vfio/vfio.c |  186 
> ++-
>  include/linux/vfio.h|3 +
>  include/uapi/linux/vfio.h   |7 ++
>  5 files changed, 209 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index 4540179..b6d3cdc 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -31,5 +31,20 @@ menuconfig VFIO
>  
> If you don't know what to do here, say N.
>  
> +menuconfig VFIO_NOIOMMU
> + bool "VFIO No-IOMMU support"
> + depends on VFIO
> + help
> +   VFIO is built on the ability to isolate devices using the IOMMU.
> +   Only with an IOMMU can userspace access to DMA capable devices be
> +   considered secure.  VFIO No-IOMMU mode enables IOMMU groups for
> +   devices without IOMMU backing for the purpose of re-using the VFIO
> +   infrastructure in a non-secure mode.  Use of this mode will result
> +   in an unsupportable kernel and will therefore taint the kernel.
> +   Device assignment to virtual machines is also not possible with
> +   this mode since there is no IOMMU to provide DMA translation.
> +
> +   If you don't know what to do here, say N.
> +
>  source "drivers/vfio/pci/Kconfig"
>  source "drivers/vfio/platform/Kconfig"
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 964ad57..32b88bd 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -940,13 +940,13 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
> struct pci_device_id *id)
>   if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
>   return -EINVAL;
>  
> - group = iommu_group_get(&pdev->dev);
> + group = vfio_iommu_group_get(&pdev->dev);
>   if (!group)
>   return -EINVAL;
>  
>   vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
>   if (!vdev) {
> - iommu_group_put(group);
> + vfio_iommu_group_put(group, &pdev->dev);
>   return -ENOMEM;
>   }
>  
> @@ -957,7 +957,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
> struct pci_device_id *id)
>  
>   ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
>   if (ret) {
> - iommu_group_put(group);
> + vfio_iommu_group_put(group, &pdev->dev);
>   kfree(vdev);
>   return ret;
>   }
> @@ -993,7 +993,7 @@ static void vfio_pci_remove(struct pci_dev *pdev)
>   if (!vdev)
>   return;
>  
> - iommu_group_put(pdev->dev.iommu_group);
> + vfio_iommu_group_put(pdev->dev.iommu_group, &pdev->dev);
>   kfree(vdev);
>  
>   if (vfio_pci_

Re: [PATCH 2/5] iov: Reset resources to 0 if totalVFs increases after enabling ARI

2015-10-28 Thread Alexander Duyck


On 10/28/2015 12:52 PM, Bjorn Helgaas wrote:

On Wed, Oct 28, 2015 at 11:32:14AM -0700, Alexander Duyck wrote:

On 10/28/2015 09:37 AM, Bjorn Helgaas wrote:

Hi Alex,

On Tue, Oct 27, 2015 at 01:52:21PM -0700, Alexander Duyck wrote:

This patch forces us to reallocate VF BARs if the totalVFs value has
increased after enabling ARI.  This normally shouldn't occur, however I
have seen some non-spec devices that shift between 7 and some value greater
than 7 based on the ARI value and we want to avoid triggering any issues
with such devices.


Can you include specifics about the devices?  The value "7" is pretty
specific, so if we're going to include that level of detail, we should
have the actual device info to go with it.


I referenced 7 as that is the largest number of VFs a single
function can support assuming a single function without ARI and
without the ability to handle Type 1 configuration requests.  The
Intel fm10k driver has logic in it that does a check for ARI and if
it is supported it reports via sysfs a totalVFs of 64, otherwise it
limits the totalVFs reported to 7.  However, I don't believe it
exposes the limitation via the configuration space.


Ah, OK, that makes sense.


I guess the problem is:

   - Device supports 7 TotalVFs with ARI disabled, >7 with ARI enabled
   - Firmware leaves ARI disabled in SRIOV_CTRL
   - Firmware computes size based on 7 VFs
   - Firmware allocates space and programs BARs for 7 VFs
   - Linux enables ARI, reads >7 TotalVFs
   - Linux computes size based on >7 VFs
   - Increased size may overlap other resources

Right?


Right.  More than likely what will happen is that you will see
overlap of the device on itself if it has multiple base address
registers assigned to the VFs.


Fixes: 3aa71da412fe ("PCI: Enable SR-IOV ARI Capable Hierarchy before reading 
TotalVFs")
Signed-off-by: Alexander Duyck 
---
  drivers/pci/iov.c |   11 ++-
  1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 099050d78a39..238950412de0 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -393,7 +393,7 @@ static int sriov_init(struct pci_dev *dev, int pos)
int rc;
int nres;
u32 pgsz;
-   u16 ctrl, total;
+   u16 ctrl, total, orig_total;
struct pci_sriov *iov;
struct resource *res;
struct pci_dev *pdev;
@@ -402,6 +402,7 @@ static int sriov_init(struct pci_dev *dev, int pos)
pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT)
return -ENODEV;

+   pci_read_config_word(dev, pos + PCI_SRIOV_TOTAL_VF, &orig_total);
pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, &ctrl);
if (ctrl & PCI_SRIOV_CTRL_VFE) {
pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, 0);
@@ -450,6 +451,14 @@ found:
}
iov->barsz[i] = resource_size(res);
res->end = res->start + resource_size(res) * total - 1;
+
+   /* force reallocation of BARs if total VFs increased */
+   if (orig_total < total) {
+   res->flags |= IORESOURCE_UNSET;
+   res->end -= res->start;
+   res->start = 0;
+   }


Two thoughts here:

1) Even if the required space increased, it's possible that firmware
placed the BAR somewhere where the extra space is available.  In that
case, this forces reallocation unnecessarily.


I'd say it is possible, but not likely.  From past experience I have
seen BIOSes do some very dumb things when it comes to SR-IOV,
assuming they even support it.

In addition many of the VF devices out there support more than one
base address register per function.  The Intel NICs for example have
one for device registers and one for MSI-X registers.  And most
BIOSes usually pack one right after the other from what I have seen.
So while there may be more space there what usually happens is that
the MSI-X region will have to be relocated in order to make room for
expanding the other base address register.

My last bit on all this is that VFs are meant to be assigned into
guests.  I would argue that for the sake of security we are much
better off invalidating the VF base address registers and forcing a
reallocation if there is even a risk of the VF base address register
space overlapping with some other piece of host memory.  We don't
want to risk possibly exposing any bits of the host that we didn't
intend on.


Agreed, not likely for several reasons.


2) This *feels* like something the PCI core should be doing anyway,
even without any help here.  Shouldn't we fail in pci_claim_resource()
and set IORESOURCE_UNSET there?


This is really the core of my question -- what problem does this patch
solve?  I'm trying to figure out if delaying the read of TotalVFs
until after we set ARI Capable Hierarchy is sufficient, and if it's
not sufficient, *why* not?


I suppose you have a point.  As long as the PCI core is taking care of 
a

Re: [PATCH v2] blktrace: re-write setting q->blk_trace

2015-10-28 Thread Jeff Moyer

Davidlohr Bueso  writes:

> This is really about simplifying the double xchg patterns into
> a single cmpxchg, with the same logic. Other than the immediate
> cleanup, there are some subtleties this change deals with:
>
> (i) While the load of the old bt is fully ordered wrt everything,
> ie:
>
>   old_bt = xchg(&q->blk_trace, bt); [barrier]
>   if (old_bt)
>(void) xchg(&q->blk_trace, old_bt);[barrier]
>
> blk_trace could still be changed between the xchg and the old_bt
> load. Note that this description is merely theoretical and afaict
> very small, but doing everything in a single context with cmpxchg
> closes this potential race.
>
> (ii) Ordering guarantees are obviously kept with cmpxchg.

Hi David,

The patch itself looks ok, but it doesn't seem to apply to a recent
kernel tree.  It appears as though it is white-space damaged.  Would you
mind re-sending it?

Thanks!
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [V5, 2/6] fsl/fman: Add FMan support

2015-10-28 Thread Scott Wood

On Tue, 2015-10-27 at 11:32 -0500, Liberman Igal-B31950 wrote:

> > > +
> > > +struct device *fman_get_device(struct fman *fman) {
> > > + return fman->dev;
> > > +}
> > 
> > Is this really necessary?
> > 
> 
> Fman port needs fman->dev, fman structure is opaque, so yes, it's needed.

Why is opacity being maintained from one part of the fman driver to another?  
Isn't this the sort of excessive layering that was complained about?


> > > + /* In B4 rev 2.0 (and above) the MURAM size is 512KB.
> > > +  * Check the SVR and update MURAM size if required.
> > > +  */
> > > + u32 svr;
> > > +
> > > + svr = mfspr(SPRN_SVR);
> > > +
> > > + if ((SVR_SOC_VER(svr) == SVR_B4860) && (SVR_MAJ(svr) >=
> > 2))
> > > + fman->dts_params.muram_size = 0x8;
> > > + }
> > 
> > Why wasn't the MURAM size described in the device tree, as it was with
> > CPM/QE?
> > 
> 
> MURAM size described by the device-tree.
> In B4860 rev 2.0 (and above) MURAM size is bigger. 
> This is workaround, in order to have the same device tree for all B4860 
> revisions.

We don't support b4860 prior to rev 2.0 (due to e6500 core errata) so this is 
irrelevant.  Fix the device tree.

> > > +
> > > + of_node_put(muram_node);
> > > + of_node_put(fm_node);
> > > +
> > > + err = devm_request_irq(&of_dev->dev, irq, fman_irq,
> > > +IRQF_NO_SUSPEND, "fman", fman);
> > > + if (err < 0) {
> > > + pr_err("Error: allocating irq %d (error = %d)\n", irq, err);
> > > + goto fman_free;
> > > + }
> > 
> > Why IRQF_NO_SUSPEND?
> > 
> 
> It shouldn't be IRQF_NO_SUSPEND for now, removed. 

Why just "for now"?

-Scott

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] cpufreq: interactive: New 'interactive' governor

2015-10-28 Thread kbuild test robot

Hi Mike,

[auto build test ERROR on pm/linux-next -- if it's inappropriate base, please 
suggest rules for selecting the more suitable base]

url:
https://github.com/0day-ci/linux/commits/B-lint-Czobor/cpufreq-interactive-New-interactive-governor/20151029-041207
config: i386-allmodconfig (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

>> drivers/cpufreq/cpufreq_interactive.c:35:46: fatal error: 
>> trace/events/cpufreq_interactive.h: No such file or directory
   compilation terminated.

vim +35 drivers/cpufreq/cpufreq_interactive.c

29  #include 
30  #include 
31  #include 
32  #include 
33  
34  #define CREATE_TRACE_POINTS
  > 35  #include 
36  
37  struct cpufreq_interactive_cpuinfo {
38  struct timer_list cpu_timer;

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data

[PATCH 2/2] arm: mm: support ARCH_MMAP_RND_BITS.

2015-10-28 Thread Daniel Cashman

From: dcashman 

arm: arch_mmap_rnd() uses a hard-code value of 8 to generate the
random offset for the mmap base address.  This value represents a
compromise between increased ASLR effectiveness and avoiding
address-space fragmentation. Replace it with a Kconfig option, which
is sensibly bounded, so that platform developers may choose where to
place this compromise. Keep 8 as the minimum acceptable value.

Signed-off-by: Daniel Cashman 
---
 arch/arm/Kconfig   | 24 
 arch/arm/mm/mmap.c |  7 +--
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 639411f..d61e7e2 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -306,6 +306,30 @@ config MMU
  Select if you want MMU-based virtualised addressing space
  support by paged memory management. If unsure, say 'Y'.
 
+config ARCH_MMAP_RND_BITS_MIN
+   int
+   default 8
+
+config ARCH_MMAP_RND_BITS_MAX
+   int
+   default 14 if MMU && PAGE_OFFSET=0x4000
+   default 15 if MMU && PAGE_OFFSET=0x8000
+   default 16 if MMU
+   default 8
+
+config ARCH_MMAP_RND_BITS
+   int "Number of bits to use for ASLR of mmap base address" if EXPERT
+   range ARCH_MMAP_RND_BITS_MIN ARCH_MMAP_RND_BITS_MAX
+   default ARCH_MMAP_RND_BITS_MIN
+   help
+ This value can be used to select the number of bits to use to
+ determine the random offset to the base address of vma regions
+ resulting from mmap allocations. This value will be bounded
+ by the architecture's minimum and maximum supported values.
+
+ This value can be changed after boot using the
+ /proc/sys/kernel/mmap_rnd_bits tunable
+
 #
 # The "ARM system type" choice list is ordered alphabetically by option
 # text.  Please add new entries in the option alphabetic order.
diff --git a/arch/arm/mm/mmap.c b/arch/arm/mm/mmap.c
index 407dc78..73ca3a7 100644
--- a/arch/arm/mm/mmap.c
+++ b/arch/arm/mm/mmap.c
@@ -11,6 +11,10 @@
 #include 
 #include 
 
+int mmap_rnd_bits_min = CONFIG_ARCH_MMAP_RND_BITS_MIN;
+int mmap_rnd_bits_max = CONFIG_ARCH_MMAP_RND_BITS_MAX;
+int mmap_rnd_bits = CONFIG_ARCH_MMAP_RND_BITS;
+
 #define COLOUR_ALIGN(addr,pgoff)   \
addr)+SHMLBA-1)&~(SHMLBA-1)) +  \
 (((pgoff)

[PATCH 1/2] mm: mmap: Add new /proc tunable for mmap_base ASLR.

2015-10-28 Thread Daniel Cashman

From: dcashman 

ASLR currently only uses 8 bits to generate the random offset for the
mmap base address on 32 bit architectures. This value was chosen to
prevent a poorly chosen value from dividing the address space in such
a way as to prevent large allocations. This may not be an issue on all
platforms. Allow the specification of a minimum number of bits so that
platforms desiring greater ASLR protection may determine where to place
the trade-off.

Signed-off-by: Daniel Cashman 
---
 Documentation/sysctl/kernel.txt | 14 ++
 include/linux/mm.h  |  6 ++
 kernel/sysctl.c | 11 +++
 3 files changed, 31 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 6fccb69..0d4ca53 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -41,6 +41,7 @@ show up in /proc/sys/kernel:
 - kptr_restrict
 - kstack_depth_to_print   [ X86 only ]
 - l2cr[ PPC only ]
+- mmap_rnd_bits
 - modprobe==> Documentation/debugging-modules.txt
 - modules_disabled
 - msg_next_id[ sysv ipc ]
@@ -391,6 +392,19 @@ This flag controls the L2 cache of G3 processor boards. If
 
 ==
 
+mmap_rnd_bits:
+
+This value can be used to select the number of bits to use to
+determine the random offset to the base address of vma regions
+resulting from mmap allocations on architectures which support
+tuning address space randomization.  This value will be bounded
+by the architecture's minimum and maximum supported values.
+
+This value can be changed after boot using the
+/proc/sys/kernel/mmap_rnd_bits tunable
+
+==
+
 modules_disabled:
 
 A toggle value indicating if modules are allowed to be loaded
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80001de..15b083a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -51,6 +51,12 @@ extern int sysctl_legacy_va_layout;
 #define sysctl_legacy_va_layout 0
 #endif
 
+#ifdef CONFIG_ARCH_MMAP_RND_BITS
+extern int mmap_rnd_bits_min;
+extern int mmap_rnd_bits_max;
+extern int mmap_rnd_bits;
+#endif
+
 #include 
 #include 
 #include 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e69201d..37e657a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1139,6 +1139,17 @@ static struct ctl_table kern_table[] = {
.proc_handler   = timer_migration_handler,
},
 #endif
+#ifdef CONFIG_ARCH_MMAP_RND_BITS
+   {
+   .procname   = "mmap_rnd_bits",
+   .data   = &mmap_rnd_bits,
+   .maxlen = sizeof(mmap_rnd_bits),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = &mmap_rnd_bits_min,
+   .extra2 = &mmap_rnd_bits_max,
+   },
+#endif
{ }
 };
 
-- 
2.6.0.rc2.230.g3dd15c0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2] arm64: dts: Added syscon-reboot node for FSL's LS2085A SoC

2015-10-28 Thread J. German Rivera

Added sys-reboot node to the FSL's LS2085A SoC DT to leverage
the ARM-generic reboot mechanism for this SoC. This mechanism
is enabled through CONFIG_POWER_RESET_SYSCON.

Signed-off-by: J. German Rivera 
---
CHANGE HISTORY

Changes in v2:
- Address comment form Stuart Yoder:
  * Removed "@" from reboot node

 arch/arm64/boot/dts/freescale/fsl-ls2085a.dtsi | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/arm64/boot/dts/freescale/fsl-ls2085a.dtsi 
b/arch/arm64/boot/dts/freescale/fsl-ls2085a.dtsi
index e281ceb..8fb3646 100644
--- a/arch/arm64/boot/dts/freescale/fsl-ls2085a.dtsi
+++ b/arch/arm64/boot/dts/freescale/fsl-ls2085a.dtsi
@@ -131,6 +131,18 @@
interrupts = <1 9 0x4>;
};

+   rst_ccsr: rstccsr@1E6 {
+   compatible = "syscon";
+   reg = <0x0 0x1E6 0x0 0x1>;
+   };
+
+   reboot {
+   compatible ="syscon-reboot";
+   regmap = <&rst_ccsr>;
+   offset = <0x0>;
+   mask = <0x2>;
+   };
+
timer {
compatible = "arm,armv8-timer";
interrupts = <1 13 0x8>, /* Physical Secure PPI, active-low */
--
2.3.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] vfio: Include No-IOMMU mode

2015-10-28 Thread Alex Williamson

There is really no way to safely give a user full access to a DMA
capable device without an IOMMU to protect the host system.  There is
also no way to provide DMA translation, for use cases such as device
assignment to virtual machines.  However, there are still those users
that want userspace drivers even under those conditions.  The UIO
driver exists for this use case, but does not provide the degree of
device access and programming that VFIO has.  In an effort to avoid
code duplication, this introduces a No-IOMMU mode for VFIO.

This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
the "enable_unsafe_noiommu_mode" option on the vfio driver.  This
should make it very clear that this mode is not safe.  Additionally,
CAP_SYS_RAWIO privileges are necessary to work with groups and
containers using this mode.  Groups making use of this support are
named /dev/vfio/noiommu-$GROUP and can only make use of the special
VFIO_NOIOMMU_IOMMU for the container.  Use of this mode, specifically
binding a device without a native IOMMU group to a VFIO bus driver
will taint the kernel and should therefore not be considered
supported.  This patch includes no-iommu support for the vfio-pci bus
driver only.

Signed-off-by: Alex Williamson 
---

This is pretty well the same as RFCv2, I've changed the pr_warn to a
dev_warn and added another, printing the pid and comm of the task when
it actually opens the device.  If Stephen can port the driver code
over and prove that this actually works sometime next week, and there
aren't any objections to this code, I'll include it in a pull request
for the next merge window.  MST, I dropped your ack due to the
changes, but I'll be happy to add it back if you like.  Thanks,

Alex

 drivers/vfio/Kconfig|   15 +++
 drivers/vfio/pci/vfio_pci.c |8 +-
 drivers/vfio/vfio.c |  186 ++-
 include/linux/vfio.h|3 +
 include/uapi/linux/vfio.h   |7 ++
 5 files changed, 209 insertions(+), 10 deletions(-)

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 4540179..b6d3cdc 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -31,5 +31,20 @@ menuconfig VFIO
 
  If you don't know what to do here, say N.
 
+menuconfig VFIO_NOIOMMU
+   bool "VFIO No-IOMMU support"
+   depends on VFIO
+   help
+ VFIO is built on the ability to isolate devices using the IOMMU.
+ Only with an IOMMU can userspace access to DMA capable devices be
+ considered secure.  VFIO No-IOMMU mode enables IOMMU groups for
+ devices without IOMMU backing for the purpose of re-using the VFIO
+ infrastructure in a non-secure mode.  Use of this mode will result
+ in an unsupportable kernel and will therefore taint the kernel.
+ Device assignment to virtual machines is also not possible with
+ this mode since there is no IOMMU to provide DMA translation.
+
+ If you don't know what to do here, say N.
+
 source "drivers/vfio/pci/Kconfig"
 source "drivers/vfio/platform/Kconfig"
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 964ad57..32b88bd 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -940,13 +940,13 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
return -EINVAL;
 
-   group = iommu_group_get(&pdev->dev);
+   group = vfio_iommu_group_get(&pdev->dev);
if (!group)
return -EINVAL;
 
vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
if (!vdev) {
-   iommu_group_put(group);
+   vfio_iommu_group_put(group, &pdev->dev);
return -ENOMEM;
}
 
@@ -957,7 +957,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
 
ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
if (ret) {
-   iommu_group_put(group);
+   vfio_iommu_group_put(group, &pdev->dev);
kfree(vdev);
return ret;
}
@@ -993,7 +993,7 @@ static void vfio_pci_remove(struct pci_dev *pdev)
if (!vdev)
return;
 
-   iommu_group_put(pdev->dev.iommu_group);
+   vfio_iommu_group_put(pdev->dev.iommu_group, &pdev->dev);
kfree(vdev);
 
if (vfio_pci_is_vga(pdev)) {
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 1c0f98c..b0408be 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -62,6 +62,7 @@ struct vfio_container {
struct rw_semaphore group_lock;
struct vfio_iommu_driver*iommu_driver;
void*iommu_data;
+   boolnoiommu;
 };
 
 struct vfio_unbound_dev {
@@ -84,6 +85,7 @@ struct vfio_group {
struct list_headunbound_list;
struct mutex

[PATCH 4/5] crypto: AES CBC by8 encryption

2015-10-28 Thread Tim Chen


This patch introduces the assembly routine to do a by8 AES CBC encryption
in support of the AES CBC multi-buffer implementation.

Encryption of 8 data streams of a key size are done simultaneously.

Originally-by: Chandramouli Narayanan 
Signed-off-by: Tim Chen 
---
 arch/x86/crypto/aes-cbc-mb/aes_cbc_enc_x8.S | 774 
 1 file changed, 774 insertions(+)
 create mode 100644 arch/x86/crypto/aes-cbc-mb/aes_cbc_enc_x8.S

diff --git a/arch/x86/crypto/aes-cbc-mb/aes_cbc_enc_x8.S 
b/arch/x86/crypto/aes-cbc-mb/aes_cbc_enc_x8.S
new file mode 100644
index 000..eaffc28
--- /dev/null
+++ b/arch/x86/crypto/aes-cbc-mb/aes_cbc_enc_x8.S
@@ -0,0 +1,774 @@
+/*
+ * AES CBC by8 multibuffer optimization (x86_64)
+ * This file implements 128/192/256 bit AES CBC encryption
+ *
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * Contact Information:
+ * James Guilford 
+ * Sean Gulley 
+ * Tim Chen 
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+#include 
+
+/* stack size needs to be an odd multiple of 8 for alignment */
+
+#define AES_KEYSIZE_12816
+#define AES_KEYSIZE_19224
+#define AES_KEYSIZE_25632
+
+#define XMM_SAVE_SIZE  16*10
+#define GPR_SAVE_SIZE  8*9
+#define STACK_SIZE (XMM_SAVE_SIZE + GPR_SAVE_SIZE)
+
+#define GPR_SAVE_REG   %rsp
+#define GPR_SAVE_AREA  %rsp + XMM_SAVE_SIZE
+#define LEN_AREA_OFFSETXMM_SAVE_SIZE + 8*8
+#define LEN_AREA_REG   %rsp
+#define LEN_AREA   %rsp + XMM_SAVE_SIZE + 8*8
+
+#define IN_OFFSET  0
+#define OUT_OFFSET 8*8
+#define KEYS_OFFSET16*8
+#define IV_OFFSET  24*8
+
+
+#define IDX%rax
+#define TMP%rbx
+#define ARG%rdi
+#define LEN%rsi
+
+#define KEYS0  %r14
+#define KEYS1  %r15
+#define KEYS2  %rbp
+#define KEYS3  %rdx
+#define KEYS4  %rcx
+#define KEYS5  %r8
+#define KEYS6  %r9
+#define KEYS7  %r10
+
+#define IN0%r11
+#define IN2%r12
+#define IN4%r13
+#define IN6LEN
+
+#define XDATA0 %xmm0
+#define XDATA1 %xmm1
+#define XDATA2 %xmm2
+#define XDATA3 %xmm3
+#define XDATA4 %xmm4
+#define XDATA5 %xmm5
+#define XDATA6 %xmm6
+#define XDATA7 %xmm7
+
+#define XKEY0_3%xmm8
+#define XKEY1_4%xmm9
+#define XKEY2_5%xmm10
+#define XKEY3_6%xmm11
+#define XKEY4_7%xmm12
+#define XKEY5_8%xmm13
+#define XKEY6_9%xmm14
+#define XTMP   %xmm15
+
+#defineMOVDQ movdqu /* assume buffers not aligned */
+#define CONCAT(a, b)   a##b
+#define INPUT_REG_SUFX 1   /* IN */
+#define XDATA_REG_SUFX 2   /* XDAT */
+#define KEY_REG_SUFX   3   /* KEY */
+#define XMM_REG_SUFX   4   /* XMM */
+
+/*
+ * To avoid positional parameter errors while compiling
+ * three registers need to be passed
+ */
+.text
+
+.macro pxor2 x, y, z
+   MOVDQ   (\x,\y), XTMP
+   pxorXTMP, \z
+.endm
+
+.macro inreg n

[PATCH 5/5] crypto: AES CBC multi-buffer glue code

2015-10-28 Thread Tim Chen


This patch introduces the multi-buffer job manager which is responsible
for submitting scatter-gather buffers from several AES CBC jobs
to the multi-buffer algorithm. The glue code interfaces with the
underlying algorithm that handles 8 data streams of AES CBC encryption
in parallel. AES key expansion and CBC decryption requests are performed
in a manner similar to the existing AESNI Intel glue driver.

The outline of the algorithm for AES CBC encryption requests is
sketched below:

Any driver requesting the crypto service will place an async crypto
request on the workqueue.  The multi-buffer crypto daemon will pull an
AES CBC encryption request from work queue and put each request in an
empty data lane for multi-buffer crypto computation.  When all the empty
lanes are filled, computation will commence on the jobs in parallel and
the job with the shortest remaining buffer will get completed and be
returned. To prevent prolonged stall, when no new jobs arrive, we will
flush workqueue of jobs after a maximum allowable delay has elapsed.

To accommodate the fragmented nature of scatter-gather, we will keep
submitting the next scatter-buffer fragment for a job for multi-buffer
computation until a job is completed and no more buffer fragments remain.
At that time we will pull a new job to fill the now empty data slot.
We check with the multibuffer scheduler to see if there are other
completed jobs to prevent extraneous delay in returning any completed
jobs.

This multi-buffer algorithm should be used for cases where we get at
least 8 streams of crypto jobs submitted at a reasonably high rate.
For low crypto job submission rate and low number of data streams, this
algorithm will not be beneficial. The reason is at low rate, we do not
fill out the data lanes before flushing the jobs instead of processing
them with all the data lanes full.  We will miss the benefit of parallel
computation, and adding delay to the processing of the crypto job at the
same time.  Some tuning of the maximum latency parameter may be needed
to get the best performance.

Originally-by: Chandramouli Narayanan 
Signed-off-by: Tim Chen 
---
 arch/x86/crypto/Makefile|   1 +
 arch/x86/crypto/aes-cbc-mb/Makefile |  22 +
 arch/x86/crypto/aes-cbc-mb/aes_cbc_mb.c | 815 
 3 files changed, 838 insertions(+)
 create mode 100644 arch/x86/crypto/aes-cbc-mb/Makefile
 create mode 100644 arch/x86/crypto/aes-cbc-mb/aes_cbc_mb.c

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index b9b912a..000db49 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -33,6 +33,7 @@ obj-$(CONFIG_CRYPTO_CRC32_PCLMUL) += crc32-pclmul.o
 obj-$(CONFIG_CRYPTO_SHA256_SSSE3) += sha256-ssse3.o
 obj-$(CONFIG_CRYPTO_SHA512_SSSE3) += sha512-ssse3.o
 obj-$(CONFIG_CRYPTO_CRCT10DIF_PCLMUL) += crct10dif-pclmul.o
+obj-$(CONFIG_CRYPTO_AES_CBC_MB) += aes-cbc-mb/
 obj-$(CONFIG_CRYPTO_POLY1305_X86_64) += poly1305-x86_64.o
 
 # These modules require assembler to support AVX.
diff --git a/arch/x86/crypto/aes-cbc-mb/Makefile 
b/arch/x86/crypto/aes-cbc-mb/Makefile
new file mode 100644
index 000..b642bd8
--- /dev/null
+++ b/arch/x86/crypto/aes-cbc-mb/Makefile
@@ -0,0 +1,22 @@
+#
+# Arch-specific CryptoAPI modules.
+#
+
+avx_supported := $(call as-instr,vpxor %xmm0$(comma)%xmm0$(comma)%xmm0,yes,no)
+
+# we need decryption and key expansion routine symbols
+# if either AESNI_NI_INTEL or AES_CBC_MB is a module
+
+ifeq ($(CONFIG_CRYPTO_AES_NI_INTEL),m)
+   dec_support := ../aesni-intel_asm.o
+endif
+ifeq ($(CONFIG_CRYPTO_AES_CBC_MB),m)
+   dec_support := ../aesni-intel_asm.o
+endif
+
+ifeq ($(avx_supported),yes)
+   obj-$(CONFIG_CRYPTO_AES_CBC_MB) += aes-cbc-mb.o
+   aes-cbc-mb-y := $(dec_support) aes_cbc_mb.o aes_mb_mgr_init.o \
+   mb_mgr_inorder_x8_asm.o mb_mgr_ooo_x8_asm.o \
+   aes_cbc_enc_x8.o
+endif
diff --git a/arch/x86/crypto/aes-cbc-mb/aes_cbc_mb.c 
b/arch/x86/crypto/aes-cbc-mb/aes_cbc_mb.c
new file mode 100644
index 000..037d4e8
--- /dev/null
+++ b/arch/x86/crypto/aes-cbc-mb/aes_cbc_mb.c
@@ -0,0 +1,815 @@
+/*
+ * Multi buffer AES CBC algorithm glue code
+ *
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * Contact Information:
+ * James Guilford 
+ * Sean Gulley 
+ * Tim Chen 
+ */
+
+#define pr_fmt(fmt)KBUILD_MODNAME "

[PATCH 3/5] crypto: AES CBC multi-buffer scheduler

2015-10-28 Thread Tim Chen


This patch implements in-order scheduler for encrypting multiple buffers
in parallel supporting AES CBC encryption with key sizes of
128, 192 and 256 bits. It uses 8 data lanes by taking advantage of the
SIMD instructions with AVX2 registers.

The multibuffer manager and scheduler is mostly written in assembly and
the initialization support is written C. The AES CBC multibuffer crypto
driver support interfaces with the multibuffer manager and scheduler
to support AES CBC encryption in parallel. The scheduler supports
job submissions, job flushing and and job retrievals after completion.

The basic flow of usage of the CBC multibuffer scheduler is as follows:

- The caller allocates an aes_cbc_mb_mgr_inorder_x8 object
and initializes it once by calling aes_cbc_init_mb_mgr_inorder_x8().

- The aes_cbc_mb_mgr_inorder_x8 structure has an array of JOB_AES
objects. Allocation and scheduling of JOB_AES objects are managed
by the multibuffer scheduler support routines. The caller allocates
a JOB_AES using aes_cbc_get_next_job_inorder_x8().

- The returned JOB_AES must be filled in with parameters for CBC
encryption (eg: plaintext buffer, ciphertext buffer, key, iv, etc) and
submitted to the manager object using aes_cbc_submit_job_inorder_xx().

- If the oldest JOB_AES is completed during a call to
aes_cbc_submit_job_inorder_x8(), it is returned. Otherwise,
NULL is returned.

- A call to aes_cbc_flush_job_inorder_x8() always returns the
oldest job, unless the multibuffer manager is empty of jobs.

- A call to aes_cbc_get_completed_job_inorder_x8() returns
a completed job. This routine is useful to process completed
jobs instead of waiting for the flusher to engage.

- When a job is returned from submit or flush, the caller extracts
the useful data and returns it to the multibuffer manager implicitly
by the next call to aes_cbc_get_next_job_xx().

Jobs are always returned from submit or flush routines in the order they
were submitted (hence "inorder").A job allocated using
aes_cbc_get_next_job_inorder_x8() must be filled in and submitted before
another call. A job returned by aes_cbc_submit_job_inorder_x8() or
aes_cbc_flush_job_inorder_x8() is 'deallocated' upon the next call to
get a job structure. Calls to get_next_job() cannot fail. If all jobs are
allocated after a call to get_next_job(), the subsequent call to submit
always returns the oldest job in a completed state.

Originally-by: Chandramouli Narayanan 
Signed-off-by: Tim Chen 
---
 arch/x86/crypto/aes-cbc-mb/aes_mb_mgr_init.c   | 145 +++
 arch/x86/crypto/aes-cbc-mb/mb_mgr_inorder_x8_asm.S | 222 +++
 arch/x86/crypto/aes-cbc-mb/mb_mgr_ooo_x8_asm.S | 416 +
 3 files changed, 783 insertions(+)
 create mode 100644 arch/x86/crypto/aes-cbc-mb/aes_mb_mgr_init.c
 create mode 100644 arch/x86/crypto/aes-cbc-mb/mb_mgr_inorder_x8_asm.S
 create mode 100644 arch/x86/crypto/aes-cbc-mb/mb_mgr_ooo_x8_asm.S

diff --git a/arch/x86/crypto/aes-cbc-mb/aes_mb_mgr_init.c 
b/arch/x86/crypto/aes-cbc-mb/aes_mb_mgr_init.c
new file mode 100644
index 000..7a7f8a1
--- /dev/null
+++ b/arch/x86/crypto/aes-cbc-mb/aes_mb_mgr_init.c
@@ -0,0 +1,145 @@
+/*
+ * Initialization code for multi buffer AES CBC algorithm
+ *
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * Contact Information:
+ * James Guilford 
+ * Sean Gulley 
+ * Tim Chen 
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOS

[PATCH 1/5] crypto: Multi-buffer encryptioin infrastructure support

2015-10-28 Thread Tim Chen


In this patch, the infrastructure needed in support of multibuffer
encryption implementation is added:

a) Enhace mcryptd daemon to support blkcipher requests.

b) Update configuration to include multi-buffer encryption build support.

c) Add support to crypto scatterwalk that can sleep during encryption
operation, as we may have buffers for jobs in data lanes that are
half-finished, waiting for additional jobs to come to fill empty lanes
before we start the encryption again.  Therefore, we need to enhance
crypto walk with the option to map data buffers non-atomically.  This is
done by algorithms run from crypto daemon who knows it is safe to do so
as it can save and restore FPU state in correct context.

For an introduction to the multi-buffer implementation, please see
http://www.intel.com/content/www/us/en/communications/communications-ia-multi-buffer-paper.html

Originally-by: Chandramouli Narayanan 
Signed-off-by: Tim Chen 
---
 crypto/Kconfig   |  16 +++
 crypto/blkcipher.c   |  29 -
 crypto/mcryptd.c | 256 ++-
 crypto/scatterwalk.c |   7 ++
 include/crypto/algapi.h  |   1 +
 include/crypto/mcryptd.h |  36 ++
 include/crypto/scatterwalk.h |   6 +
 include/linux/crypto.h   |   1 +
 8 files changed, 347 insertions(+), 5 deletions(-)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 7240821..6b51084 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -888,6 +888,22 @@ config CRYPTO_AES_NI_INTEL
  ECB, CBC, LRW, PCBC, XTS. The 64 bit version has additional
  acceleration for CTR.
 
+config CRYPTO_AES_CBC_MB
+   tristate "AES CBC algorithm (x86_64 Multi-Buffer, Experimental)"
+   depends on X86 && 64BIT
+   select CRYPTO_ABLK_HELPER
+   select CRYPTO_MCRYPTD
+   help
+ AES CBC encryption implemented using multi-buffer technique.
+ This algorithm computes on multiple data lanes concurrently with
+ SIMD instructions for better throughput.  It should only be
+ used when there is significant work to generate many separate
+ crypto requests that keep all the data lanes filled to get
+ the performance benefit.  If the data lanes are unfilled, a
+ flush operation will be initiated after some delay to process
+ the exisiting crypto jobs, adding some extra latency at low
+ load case.
+
 config CRYPTO_AES_SPARC64
tristate "AES cipher algorithms (SPARC64)"
depends on SPARC64
diff --git a/crypto/blkcipher.c b/crypto/blkcipher.c
index 11b9814..9fd4028 100644
--- a/crypto/blkcipher.c
+++ b/crypto/blkcipher.c
@@ -35,6 +35,9 @@ enum {
BLKCIPHER_WALK_SLOW = 1 << 1,
BLKCIPHER_WALK_COPY = 1 << 2,
BLKCIPHER_WALK_DIFF = 1 << 3,
+   /* deal with scenarios where we can sleep during sg walk */
+   /* when we process part of a request */
+   BLKCIPHER_WALK_MAY_SLEEP = 1 << 4,
 };
 
 static int blkcipher_walk_next(struct blkcipher_desc *desc,
@@ -44,22 +47,38 @@ static int blkcipher_walk_first(struct blkcipher_desc *desc,
 
 static inline void blkcipher_map_src(struct blkcipher_walk *walk)
 {
-   walk->src.virt.addr = scatterwalk_map(&walk->in);
+   /* add support for asynchronous requests which need no atomic map */
+   if (walk->flags & BLKCIPHER_WALK_MAY_SLEEP)
+   walk->src.virt.addr = scatterwalk_map_nonatomic(&walk->in);
+   else
+   walk->src.virt.addr = scatterwalk_map(&walk->in);
 }
 
 static inline void blkcipher_map_dst(struct blkcipher_walk *walk)
 {
-   walk->dst.virt.addr = scatterwalk_map(&walk->out);
+   /* add support for asynchronous requests which need no atomic map */
+   if (walk->flags & BLKCIPHER_WALK_MAY_SLEEP)
+   walk->dst.virt.addr = scatterwalk_map_nonatomic(&walk->out);
+   else
+   walk->dst.virt.addr = scatterwalk_map(&walk->out);
 }
 
 static inline void blkcipher_unmap_src(struct blkcipher_walk *walk)
 {
-   scatterwalk_unmap(walk->src.virt.addr);
+   /* add support for asynchronous requests which need no atomic map */
+   if (walk->flags & BLKCIPHER_WALK_MAY_SLEEP)
+   scatterwalk_unmap_nonatomic(walk->src.virt.addr);
+   else
+   scatterwalk_unmap(walk->src.virt.addr);
 }
 
 static inline void blkcipher_unmap_dst(struct blkcipher_walk *walk)
 {
-   scatterwalk_unmap(walk->dst.virt.addr);
+   /* add support for asynchronous requests which need no atomic map */
+   if (walk->flags & BLKCIPHER_WALK_MAY_SLEEP)
+   scatterwalk_unmap_nonatomic(walk->dst.virt.addr);
+   else
+   scatterwalk_unmap(walk->dst.virt.addr);
 }
 
 /* Get a spot of the specified length that does not straddle a page.
@@ -299,6 +318,8 @@ static inline int blkcipher_copy_iv(struct blkcipher_walk 
*walk)
 int blkcipher_walk_virt(struct blkcipher_desc *desc,
struct blkcipher_walk *walk)
 {
+

[PATCH 2/5] crypto: AES CBC multi-buffer data structures

2015-10-28 Thread Tim Chen


This patch introduces the data structures and prototypes of functions
needed for doing AES CBC encryption using multi-buffer. Included are
the structures of the multi-buffer AES CBC job, job scheduler in C and
data structure defines in x86 assembly code.

Originally-by: Chandramouli Narayanan 
Signed-off-by: Tim Chen 
---
 arch/x86/crypto/aes-cbc-mb/aes_cbc_mb_ctx.h|  96 +
 arch/x86/crypto/aes-cbc-mb/aes_cbc_mb_mgr.h| 131 
 arch/x86/crypto/aes-cbc-mb/mb_mgr_datastruct.S | 270 +
 arch/x86/crypto/aes-cbc-mb/reg_sizes.S | 125 
 4 files changed, 622 insertions(+)
 create mode 100644 arch/x86/crypto/aes-cbc-mb/aes_cbc_mb_ctx.h
 create mode 100644 arch/x86/crypto/aes-cbc-mb/aes_cbc_mb_mgr.h
 create mode 100644 arch/x86/crypto/aes-cbc-mb/mb_mgr_datastruct.S
 create mode 100644 arch/x86/crypto/aes-cbc-mb/reg_sizes.S

diff --git a/arch/x86/crypto/aes-cbc-mb/aes_cbc_mb_ctx.h 
b/arch/x86/crypto/aes-cbc-mb/aes_cbc_mb_ctx.h
new file mode 100644
index 000..5493f83
--- /dev/null
+++ b/arch/x86/crypto/aes-cbc-mb/aes_cbc_mb_ctx.h
@@ -0,0 +1,96 @@
+/*
+ * Header file for multi buffer AES CBC algorithm manager
+ * that deals with 8 buffers at a time
+ *
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * Contact Information:
+ * James Guilford 
+ * Sean Gulley 
+ * Tim Chen 
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+#ifndef __AES_CBC_MB_CTX_H
+#define __AES_CBC_MB_CTX_H
+
+
+#include 
+
+#include "aes_cbc_mb_mgr.h"
+
+#define CBC_ENCRYPT0x01
+#define CBC_DECRYPT0x02
+#define CBC_START  0x04
+#define CBC_DONE   0x08
+
+#define CBC_CTX_STS_IDLE   0x00
+#define CBC_CTX_STS_PROCESSING 0x01
+#define CBC_CTX_STS_LAST   0x02
+#define CBC_CTX_STS_COMPLETE   0x04
+
+enum cbc_ctx_error {
+   CBC_CTX_ERROR_NONE   =  0,
+   CBC_CTX_ERROR_INVALID_FLAGS  = -1,
+   CBC_CTX_ERROR_ALREADY_PROCESSING = -2,
+   CBC_CTX_ERROR_ALREADY_COMPLETED  = -3,
+};
+
+#define cbc_ctx_init(ctx, nbytes, op) \
+   do { \
+   (ctx)->flag = (op) | CBC_START; \
+   (ctx)->nbytes = nbytes; \
+   } while (0)
+
+/* AESNI routines to perform cbc decrypt and key expansion */
+
+asmlinkage void aesni_cbc_dec(struct crypto_aes_ctx *ctx, u8 *out,
+ const u8 *in, unsigned int len, u8 *iv);
+asmlinkage int aesni_set_key(struct crypto_aes_ctx *ctx, const u8 *in_key,
+unsigned int key_len);
+
+#endif /* __AES_CBC_MB_CTX_H */
diff --git a/arch/x86/crypto/aes-cbc-mb/aes_cbc_mb_mgr.h 
b/arch/x86/crypto/aes-cbc-mb/aes_cbc_mb_mgr.h
new file mode 100644
index 000..0def82e
--- /dev/null
+++ b/arch/x86/crypto/aes-cbc-mb/aes_cbc_mb_mgr.h
@@ -0,0 +1,131 @@
+/*
+ * Header file for multi buffer AES CBC algorithm manager
+ *

[PATCH 0/5] x86 AES-CBC encryption with AVX2 multi-buffer

2015-10-28 Thread Tim Chen


In this patch series, we introduce AES CBC encryption that is parallelized
on x86_64 cpu with AVX2. The multi-buffer technique takes advantage
of wide AVX2 register and encrypt 8 data streams in parallel with SIMD
instructions. Decryption is handled as in the existing AESNI Intel CBC
implementation which can already parallelize decryption even for a single
data stream.

Please see the multi-buffer whitepaper for details of the technique:
http://www.intel.com/content/www/us/en/communications/communications-ia-multi-buffer-paper.html

It is important that any driver uses this algorithm properly for scenarios
where we have many data streams that can fill up the data lanes most of
the time.  It shouldn't be used when only a single data stream is expected
mostly. Otherwise we may incurr extra delays when we have frequent gaps in
data lanes, causing us to wait till data come in to fill the data lanes
before initiating encryption.  We may have to wait for flush operations
to commence when no new data come in after some wait time. However we
keep this extra delay to a minimum by opportunistically flushing the
unfinished jobs if crypto daemon is the only active task running on a cpu.

By using this technique, we saw a throughput increase of up to
5.8x under optimal conditions when we have fully loaded encryption jobs
filling up all the data lanes.

Tim Chen (5):
  crypto: Multi-buffer encryptioin infrastructure support
  crypto: AES CBC multi-buffer data structures
  crypto: AES CBC multi-buffer scheduler
  crypto: AES CBC by8 encryption
  crypto: AES CBC multi-buffer glue code

 arch/x86/crypto/Makefile   |   1 +
 arch/x86/crypto/aes-cbc-mb/Makefile|  22 +
 arch/x86/crypto/aes-cbc-mb/aes_cbc_enc_x8.S| 774 +++
 arch/x86/crypto/aes-cbc-mb/aes_cbc_mb.c| 815 +
 arch/x86/crypto/aes-cbc-mb/aes_cbc_mb_ctx.h|  96 +++
 arch/x86/crypto/aes-cbc-mb/aes_cbc_mb_mgr.h| 131 
 arch/x86/crypto/aes-cbc-mb/aes_mb_mgr_init.c   | 145 
 arch/x86/crypto/aes-cbc-mb/mb_mgr_datastruct.S | 270 +++
 arch/x86/crypto/aes-cbc-mb/mb_mgr_inorder_x8_asm.S | 222 ++
 arch/x86/crypto/aes-cbc-mb/mb_mgr_ooo_x8_asm.S | 416 +++
 arch/x86/crypto/aes-cbc-mb/reg_sizes.S | 125 
 crypto/Kconfig |  16 +
 crypto/blkcipher.c |  29 +-
 crypto/mcryptd.c   | 256 ++-
 crypto/scatterwalk.c   |   7 +
 include/crypto/algapi.h|   1 +
 include/crypto/mcryptd.h   |  36 +
 include/crypto/scatterwalk.h   |   6 +
 include/linux/crypto.h |   1 +
 19 files changed, 3364 insertions(+), 5 deletions(-)
 create mode 100644 arch/x86/crypto/aes-cbc-mb/Makefile
 create mode 100644 arch/x86/crypto/aes-cbc-mb/aes_cbc_enc_x8.S
 create mode 100644 arch/x86/crypto/aes-cbc-mb/aes_cbc_mb.c
 create mode 100644 arch/x86/crypto/aes-cbc-mb/aes_cbc_mb_ctx.h
 create mode 100644 arch/x86/crypto/aes-cbc-mb/aes_cbc_mb_mgr.h
 create mode 100644 arch/x86/crypto/aes-cbc-mb/aes_mb_mgr_init.c
 create mode 100644 arch/x86/crypto/aes-cbc-mb/mb_mgr_datastruct.S
 create mode 100644 arch/x86/crypto/aes-cbc-mb/mb_mgr_inorder_x8_asm.S
 create mode 100644 arch/x86/crypto/aes-cbc-mb/mb_mgr_ooo_x8_asm.S
 create mode 100644 arch/x86/crypto/aes-cbc-mb/reg_sizes.S

-- 
1.7.11.7


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [alsa-devel] [PATCH V2 02/10] ASoC: img: Add driver for I2S input controller

2015-10-28 Thread Damien Horsley

On 28/10/15 01:04, Mark Brown wrote:
> On Tue, Oct 27, 2015 at 01:55:27PM +, Damien Horsley wrote:
>> On 23/10/15 23:57, Mark Brown wrote:
> 
>>> Shouldn't we be doing that flush on stream close instead?  If nothing
>>> else the flush is going to discard a bit of data if the stream is just
>>> paused.
> 
>> The FIFOs are only 8 frames in size, so I am not sure there is an
>> issue with these frames being lost.
> 
>> I think it also makes sense to keep the blocks consistent with each
>> other. The spdif (out and in), and parallel out, all flush automatically
>> when stopped, and the fifo for the i2s out block is cleared when the
>> reset is asserted.
> 
> This seems like an issue that got missed in the other drivers then.  I'd
> expect the trigger operation to be a minimal operation which starts and
> stops the data transfer, not doing anything else.
> 

The spdif out, spdif in, and parallel out blocks auto-flush whenever
they are stopped. It is not possible for software to prevent this behavior.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 0/4] hugetlbfs fallocate hole punch race with page faults

2015-10-28 Thread Mike Kravetz

On 10/28/2015 02:00 PM, Hugh Dickins wrote:
> On Wed, 28 Oct 2015, Mike Kravetz wrote:
>> On 10/27/2015 08:34 PM, Hugh Dickins wrote:
>>
>> Thanks for the detailed response Hugh.  I will try to address your questions
>> and provide more reasoning behind the use case and need for this code.
> 
> And thank you for your detailed response, Mike: that helped a lot.
> 
>> Ok, here is a bit more explanation of the proposed use case.  It all
>> revolves around a DB's use of hugetlbfs and the desire for more control
>> over the underlying memory.  This additional control is achieved by
>> adding existing fallocate and userfaultfd semantics to hugetlbfs.
>>
>> In this use case there is a single process that manages hugetlbfs files
>> and the underlying memory resources.  It pre-allocates/initializes these
>> files.
>>
>> In addition, there are many other processes which access (rw mode) these
>> files.  They will simply mmap the files.  It is expected that they will
>> not fault in any new pages.  Rather, all pages would have been pre-allocated
>> by the management process.
>>
>> At some time, the management process determines that specific ranges of
>> pages within the hugetlbfs files are no longer needed.  It will then punch
>> holes in the files.  These 'free' pages within the holes may then be used
>> for other purposes.  For applications like this (sophisticated DBs), huge
>> pages are reserved at system init time and closely managed by the
>> application.
>> Hence, the desire for this additional control.
>>
>> So, when a hole containing N huge pages is punched, the management process
>> wants to know that it really has N huge pages for other purposes.  Ideally,
>> none of the other processes mapping this file/area would access the hole.
>> This is an application error, and it can be 'caught' with  userfaultfd.
>>
>> Since these other (non-management) processes will never fault in pages,
>> they would simply set up userfaultfd to catch any page faults immediately
>> after mmaping the hugetlbfs file.
>>
>>>
>>> But it sounds to me more as if the holes you want punched are not
>>> quite like on other filesystems, and you want to be able to police
>>> them afterwards with userfaultfd, to prevent them from being refilled.
>>
>> I am not sure if they are any different.
>>
>> One could argue that a hole punch operation must always result in all
>> pages within the hole being deallocated.  As you point out, this could
>> race with a fault.  Previously, there would be no way to determine if
>> all pages had been deallocated because user space could not detect this
>> race.  Now, userfaultfd allows user space to catch page faults.  So,
>> it is now possible to catch/depend on hole punch deallocating all pages
>> within the hole.
>>
>>>
>>> Can't userfaultfd be used just slightly earlier, to prevent them from
>>> being filled while doing the holepunch?  Then no need for this patchset?
>>
>> I do not think so, at least with current userfaultfd semantics.  The hole
>> needs to be punched before being caught with UFFDIO_REGISTER_MODE_MISSING.
> 
> Great, that makes sense.
> 
> I was worried that you needed some kind of atomic treatment of the whole
> extent punched, but all you need is to close the hole/fault race one
> hugepage at a time.
> 
> Throw away all of 1/4, 2/4, 3/4: I think all you need is your 4/4
> (plus i_mmap_lock_write around the hugetlb_vmdelete_list of course).
> 
> There you already do the single hugepage hugetlb_vmdelete_list()
> under mutex_lock(&hugetlb_fault_mutex_table[hash]).
> 
> And it should come as no surprise that hugetlb_fault() does most
> of its work under that same mutex.
> 
> So once remove_inode_hugepages() unlocks the mutex, that page is gone
> from the file, and userfaultfd UFFDIO_REGISTER_MODE_MISSING will do
> what you want, won't it?
> 
> I don't think "my" code buys you anything at all: you're not in danger of
> shmem's starvation livelock issue, partly because remove_inode_hugepages()
> uses the simple loop from start to end, and partly because hugetlb_fault()
> already takes the serializing mutex (no equivalent in shmem_fault()).
> 
> Or am I dreaming?

I don't think you are dreaming.

I should have stepped back and thought about this more before before pulling
in the shmem code.  It really is only a 'page at a time' operation, and we
can use the fault mutex table for that.

I'll code it up with just the changes needed for 4/4 and put it through some
stress testing.

Thanks,
-- 
Mike Kravetz

> 
> Hugh
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Mukesh Ambani Charity Donation

2015-10-28 Thread ReserveBank

Dear Email User,

This mandate is to inform you that you have been awarded 5crore Indian Rupees 
in Mukesh Ambani Charity donation held in New Delhi India,For claims send your 
NAME,ADDRESS AND PHONE NUMBER TO i...@reservebindia.onmicrosoft.com

Your sincerely,
Mrs. Renna Gupta
Reserve Bank Of India.

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] arm: dts: qcom: Add board clocks

2015-10-28 Thread Andy Gross

On Mon, Oct 26, 2015 at 06:26:53PM -0700, Stephen Boyd wrote:
> These clocks are fixed rate board sources that should be in DT.
> Add them.
> 
> Cc: Georgi Djakov 
> Signed-off-by: Stephen Boyd 
> ---

Reviewed-by: Andy Gross 

-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v5 1/3] USB: serial: cp210x: Workaround cp2108 Tx queue bug

2015-10-28 Thread Konstantin Shkolnyy

Occasionally, writing data and immediately closing the port makes cp2108
stop responding. The device has to be unplugged to clear the error.
The failure is induced by shutting down the device while its Tx queue
still has unsent data. This condition is avoided by issuing PURGE command
from the close() callback.

This change is applied to all cp210x devices. Clearing internal queues on
close is generally good.

Signed-off-by: Konstantin Shkolnyy 
---
 drivers/usb/serial/cp210x.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/drivers/usb/serial/cp210x.c b/drivers/usb/serial/cp210x.c
index eac7cca..8ba1005 100644
--- a/drivers/usb/serial/cp210x.c
+++ b/drivers/usb/serial/cp210x.c
@@ -301,6 +301,14 @@ static struct usb_serial_driver * const serial_drivers[] = 
{
 #define CONTROL_WRITE_RTS  0x0200
 
 /*
+ * CP210X_PURGE - 16 bits passed in wValue of USB request.
+ * SiLabs app note AN571 gives a strange description of the 4 bits:
+ * bit 0 or bit 2 clears the transmit queue and 1 or 3 receive.
+ * writing 1 to all, however, purges cp2108 well enough to avoid the hang.
+ */
+#define PURGE_ALL  0x000f
+
+/*
  * cp210x_get_config
  * Reads from the CP210x configuration registers
  * 'size' is specified in bytes.
@@ -475,7 +483,14 @@ static int cp210x_open(struct tty_struct *tty, struct 
usb_serial_port *port)
 
 static void cp210x_close(struct usb_serial_port *port)
 {
+   unsigned int purge_ctl;
+
usb_serial_generic_close(port);
+
+   /* Clear both queues; cp2108 needs this to avoid an occasional hang */
+   purge_ctl = PURGE_ALL;
+   cp210x_set_config(port, CP210X_PURGE, &purge_ctl, 2);
+
cp210x_set_config_single(port, CP210X_IFC_ENABLE, UART_DISABLE);
 }
 
-- 
1.8.4.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v5 3/3] USB: serial: cp210x: Workaround cp2108 GET_LINE_CTL bug

2015-10-28 Thread Konstantin Shkolnyy

cp2108 GET_LINE_CTL returns the 16-bit value with the 2 bytes swapped.
However, SET_LINE_CTL functions properly. When the driver tries to modify
the register, it reads it, modifies some bits and writes back. Because the
read bytes were swapped, this often results in an invalid value to be
written. In turn, this causes cp2108 respond with a stall. The stall
sometimes doesn't clear properly and cp2108 starts responding to following
valid commands also with stalls, effectively failing.

Signed-off-by: Konstantin Shkolnyy 
---
 drivers/usb/serial/cp210x.c | 71 +
 1 file changed, 66 insertions(+), 5 deletions(-)

diff --git a/drivers/usb/serial/cp210x.c b/drivers/usb/serial/cp210x.c
index 352fe63..e91b654 100644
--- a/drivers/usb/serial/cp210x.c
+++ b/drivers/usb/serial/cp210x.c
@@ -199,6 +199,7 @@ MODULE_DEVICE_TABLE(usb, id_table);
 
 struct cp210x_port_private {
__u8bInterfaceNumber;
+   boolhas_swapped_line_ctl;
 };
 
 static struct usb_serial_driver cp210x_device = {
@@ -419,6 +420,61 @@ static inline int cp210x_set_config_single(struct 
usb_serial_port *port,
 }
 
 /*
+ * Detect CP2108 GET_LINE_CTL bug and activate workaround.
+ * Write a known good value 0x800, read it back.
+ * If it comes back swapped the bug is detected.
+ * Preserve the original register value.
+ */
+static int cp210x_detect_swapped_line_ctl(struct usb_serial_port *port)
+{
+   struct cp210x_port_private *port_priv = usb_get_serial_port_data(port);
+   unsigned int line_ctl_save;
+   unsigned int line_ctl_test;
+   int err;
+
+   err = cp210x_get_config(port, CP210X_GET_LINE_CTL, &line_ctl_save, 2);
+   if (err)
+   return err;
+
+   line_ctl_test = 0x800;
+   err = cp210x_set_config(port, CP210X_SET_LINE_CTL, &line_ctl_test, 2);
+   if (err)
+   return err;
+
+   err = cp210x_get_config(port, CP210X_GET_LINE_CTL, &line_ctl_test, 2);
+   if (err)
+   return err;
+
+   /* has_swapped_line_ctl is 0 here because port_priv was kzalloced */
+   if (line_ctl_test == 8) {
+   port_priv->has_swapped_line_ctl = true;
+   line_ctl_save = swab16((u16)line_ctl_save);
+   }
+
+   return cp210x_set_config(port, CP210X_SET_LINE_CTL, &line_ctl_save, 2);
+}
+
+/*
+ * Must always be called instead of cp210x_get_config(CP210X_GET_LINE_CTL)
+ * to workaround cp2108 bug and get correct value.
+ */
+static int cp210x_get_line_ctl(struct usb_serial_port *port, unsigned int *ctl)
+{
+   struct cp210x_port_private *port_priv = usb_get_serial_port_data(port);
+   int err;
+
+   err = cp210x_get_config(port, CP210X_GET_LINE_CTL, ctl, 2);
+   if (err)
+   return err;
+
+   /* Workaround swapped bytes in 16-bit value from CP210X_GET_LINE_CTL */
+   if (port_priv->has_swapped_line_ctl)
+   *ctl = swab16((u16)(*ctl));
+
+   return 0;
+}
+
+/*
  * cp210x_quantise_baudrate
  * Quantises the baud rate as per AN205 Table 1
  */
@@ -535,7 +591,7 @@ static void cp210x_get_termios_port(struct usb_serial_port 
*port,
 
cflag = *cflagp;
 
-   cp210x_get_config(port, CP210X_GET_LINE_CTL, &bits, 2);
+   cp210x_get_line_ctl(port, &bits);
cflag &= ~CSIZE;
switch (bits & BITS_DATA_MASK) {
case BITS_DATA_5:
@@ -703,7 +759,7 @@ static void cp210x_set_termios(struct tty_struct *tty,
 
/* If the number of data bits is to be updated */
if ((cflag & CSIZE) != (old_cflag & CSIZE)) {
-   cp210x_get_config(port, CP210X_GET_LINE_CTL, &bits, 2);
+   cp210x_get_line_ctl(port, &bits);
bits &= ~BITS_DATA_MASK;
switch (cflag & CSIZE) {
case CS5:
@@ -737,7 +793,7 @@ static void cp210x_set_termios(struct tty_struct *tty,
 
if ((cflag & (PARENB|PARODD|CMSPAR)) !=
(old_cflag & (PARENB|PARODD|CMSPAR))) {
-   cp210x_get_config(port, CP210X_GET_LINE_CTL, &bits, 2);
+   cp210x_get_line_ctl(port, &bits);
bits &= ~BITS_PARITY_MASK;
if (cflag & PARENB) {
if (cflag & CMSPAR) {
@@ -763,7 +819,7 @@ static void cp210x_set_termios(struct tty_struct *tty,
}
 
if ((cflag & CSTOPB) != (old_cflag & CSTOPB)) {
-   cp210x_get_config(port, CP210X_GET_LINE_CTL, &bits, 2);
+   cp210x_get_line_ctl(port, &bits);
bits &= ~BITS_STOP_MASK;
if (cflag & CSTOPB) {
bits |= BITS_STOP_2;
@@ -883,6 +939,7 @@ static int cp210x_port_probe(struct usb_serial_port *port)
struct usb_serial *serial = port->serial;
struct usb_host_interface *cur_altsetting;
struct cp210x_port_private *port_priv;
+   int err;
 
port_priv = kzalloc(sizeof(*port_priv), GFP_KERNEL);
if (!port_priv)
@@ -893,7 +950,11 @@ static int

[PATCH 1/2] ARM: dts: omap4: Add elm node

2015-10-28 Thread Franklin S Cooper Jr

Add device tree entry for the error location module.

Signed-off-by: Franklin S Cooper Jr 
---
 arch/arm/boot/dts/omap4.dtsi | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/arm/boot/dts/omap4.dtsi b/arch/arm/boot/dts/omap4.dtsi
index 5a206c1..a40eb23 100644
--- a/arch/arm/boot/dts/omap4.dtsi
+++ b/arch/arm/boot/dts/omap4.dtsi
@@ -348,6 +348,14 @@
#interrupt-cells = <2>;
};
 
+   elm: elm@48078000 {
+   compatible = "ti,am3352-elm";
+   reg = <0x48078000 0x2000>;
+   interrupts = <4>;
+   ti,hwmods = "elm";
+   status = "disabled";
+   };
+
gpmc: gpmc@5000 {
compatible = "ti,omap4430-gpmc";
reg = <0x5000 0x1000>;
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:core/efi] efi: Fix warning of int-to-pointer-cast on x86 32-bit builds

2015-10-28 Thread tip-bot for Taku Izumi

Commit-ID:  78b9bc947b18ed16b6c2c573d774e6d54ad9452d
Gitweb: http://git.kernel.org/tip/78b9bc947b18ed16b6c2c573d774e6d54ad9452d
Author: Taku Izumi 
AuthorDate: Fri, 23 Oct 2015 11:48:17 +0200
Committer:  Ingo Molnar 
CommitDate: Wed, 28 Oct 2015 12:28:06 +0100

efi: Fix warning of int-to-pointer-cast on x86 32-bit builds

Commit:

  0f96a99dab36 ("efi: Add "efi_fake_mem" boot option")

introduced the following warning message:

  drivers/firmware/efi/fake_mem.c:186:20: warning: cast to pointer from integer 
of different size [-Wint-to-pointer-cast]

new_memmap_phy was defined as a u64 value and cast to void*,
causing a int-to-pointer-cast warning on x86 32-bit builds.
However, since the void* type is inappropriate for a physical
address, the definition of struct efi_memory_map::phys_map has
been changed to phys_addr_t in the previous patch, and so the
cast can be dropped entirely.

This patch also changes the type of the "new_memmap_phy"
variable from "u64" to "phys_addr_t" to align with the types of
memblock_alloc() and struct efi_memory_map::phys_map.

Reported-by: Ingo Molnar 
Signed-off-by: Taku Izumi 
[ Removed void* cast, updated commit log]
Signed-off-by: Ard Biesheuvel 
Reviewed-by: Matt Fleming 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: kamezawa.hir...@jp.fujitsu.com
Cc: linux-...@vger.kernel.org
Cc: matt.flem...@intel.com
Link: 
http://lkml.kernel.org/r/1445593697-1342-2-git-send-email-ard.biesheu...@linaro.org
Signed-off-by: Ingo Molnar 
---
 drivers/firmware/efi/fake_mem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c
index 32bcb14..ed3a854 100644
--- a/drivers/firmware/efi/fake_mem.c
+++ b/drivers/firmware/efi/fake_mem.c
@@ -59,7 +59,7 @@ void __init efi_fake_memmap(void)
u64 start, end, m_start, m_end, m_attr;
int new_nr_map = memmap.nr_map;
efi_memory_desc_t *md;
-   u64 new_memmap_phy;
+   phys_addr_t new_memmap_phy;
void *new_memmap;
void *old, *new;
int i;
@@ -183,7 +183,7 @@ void __init efi_fake_memmap(void)
/* swap into new EFI memmap */
efi_unmap_memmap();
memmap.map = new_memmap;
-   memmap.phys_map = (void *)new_memmap_phy;
+   memmap.phys_map = new_memmap_phy;
memmap.nr_map = new_nr_map;
memmap.map_end = memmap.map + memmap.nr_map * memmap.desc_size;
set_bit(EFI_MEMMAP, &efi.flags);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:core/efi] efi: Use correct type for struct efi_memory_map:: phys_map

2015-10-28 Thread tip-bot for Ard Biesheuvel

Commit-ID:  44511fb9e55ada760822b0b0d7be9d150576f17f
Gitweb: http://git.kernel.org/tip/44511fb9e55ada760822b0b0d7be9d150576f17f
Author: Ard Biesheuvel 
AuthorDate: Fri, 23 Oct 2015 11:48:16 +0200
Committer:  Ingo Molnar 
CommitDate: Wed, 28 Oct 2015 12:28:06 +0100

efi: Use correct type for struct efi_memory_map::phys_map

We have been getting away with using a void* for the physical
address of the UEFI memory map, since, even on 32-bit platforms
with 64-bit physical addresses, no truncation takes place if the
memory map has been allocated by the firmware (which only uses
1:1 virtually addressable memory), which is usually the case.

However, commit:

  0f96a99dab36 ("efi: Add "efi_fake_mem" boot option")

adds code that clones and modifies the UEFI memory map, and the
clone may live above 4 GB on 32-bit platforms.

This means our use of void* for struct efi_memory_map::phys_map has
graduated from 'incorrect but working' to 'incorrect and
broken', and we need to fix it.

So redefine struct efi_memory_map::phys_map as phys_addr_t, and
get rid of a bunch of casts that are now unneeded.

Signed-off-by: Ard Biesheuvel 
Reviewed-by: Matt Fleming 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: izumi.t...@jp.fujitsu.com
Cc: kamezawa.hir...@jp.fujitsu.com
Cc: linux-...@vger.kernel.org
Cc: matt.flem...@intel.com
Link: 
http://lkml.kernel.org/r/1445593697-1342-1-git-send-email-ard.biesheu...@linaro.org
Signed-off-by: Ingo Molnar 
---
 arch/arm64/kernel/efi.c | 4 ++--
 arch/x86/platform/efi/efi.c | 4 ++--
 drivers/firmware/efi/efi.c  | 8 
 include/linux/efi.h | 2 +-
 4 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index 4b7df34..61eb1d1 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -208,7 +208,7 @@ void __init efi_init(void)
 
memblock_reserve(params.mmap & PAGE_MASK,
 PAGE_ALIGN(params.mmap_size + (params.mmap & 
~PAGE_MASK)));
-   memmap.phys_map = (void *)params.mmap;
+   memmap.phys_map = params.mmap;
memmap.map = early_memremap(params.mmap, params.mmap_size);
memmap.map_end = memmap.map + params.mmap_size;
memmap.desc_size = params.desc_size;
@@ -282,7 +282,7 @@ static int __init arm64_enable_runtime_services(void)
pr_info("Remapping and enabling EFI services.\n");
 
mapsize = memmap.map_end - memmap.map;
-   memmap.map = (__force void *)ioremap_cache((phys_addr_t)memmap.phys_map,
+   memmap.map = (__force void *)ioremap_cache(memmap.phys_map,
   mapsize);
if (!memmap.map) {
pr_err("Failed to remap EFI memory map\n");
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index 3e1d09e..ad28540 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -194,7 +194,7 @@ static void __init do_add_efi_memmap(void)
 int __init efi_memblock_x86_reserve_range(void)
 {
struct efi_info *e = &boot_params.efi_info;
-   unsigned long pmap;
+   phys_addr_t pmap;
 
if (efi_enabled(EFI_PARAVIRT))
return 0;
@@ -209,7 +209,7 @@ int __init efi_memblock_x86_reserve_range(void)
 #else
pmap = (e->efi_memmap | ((__u64)e->efi_memmap_hi << 32));
 #endif
-   memmap.phys_map = (void *)pmap;
+   memmap.phys_map = pmap;
memmap.nr_map   = e->efi_memmap_size /
  e->efi_memdesc_size;
memmap.desc_size= e->efi_memdesc_size;
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 31fc864..027ca21 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -254,7 +254,7 @@ subsys_initcall(efisubsys_init);
 int __init efi_mem_desc_lookup(u64 phys_addr, efi_memory_desc_t *out_md)
 {
struct efi_memory_map *map = efi.memmap;
-   void *p, *e;
+   phys_addr_t p, e;
 
if (!efi_enabled(EFI_MEMMAP)) {
pr_err_once("EFI_MEMMAP is not enabled.\n");
@@ -286,10 +286,10 @@ int __init efi_mem_desc_lookup(u64 phys_addr, 
efi_memory_desc_t *out_md)
 * So just always get our own virtual map on the CPU.
 *
 */
-   md = early_memremap((phys_addr_t)p, sizeof (*md));
+   md = early_memremap(p, sizeof (*md));
if (!md) {
-   pr_err_once("early_memremap(%p, %zu) failed.\n",
-   p, sizeof (*md));
+   pr_err_once("early_memremap(%pa, %zu) failed.\n",
+   &p, sizeof (*md));
return -ENOMEM;
}
 
diff --git a/include/linux/efi.h b/include/linux/efi.h
index 4d01c10..569b5a8 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -680,7 +680,7 @@ typedef struct {
 } efi_system_table_t;
 
 struct efi_memory_map {
-

[PATCH 2/2] ARM: omap4: hwmod: Remove elm address space from hwmod data

2015-10-28 Thread Franklin S Cooper Jr

ELM address information is provided by device tree. No longer need
to include this information within hwmod.

This patch has only been boot tested.

Signed-off-by: Franklin S Cooper Jr 
---
 arch/arm/mach-omap2/omap_hwmod_44xx_data.c | 10 --
 1 file changed, 10 deletions(-)

diff --git a/arch/arm/mach-omap2/omap_hwmod_44xx_data.c 
b/arch/arm/mach-omap2/omap_hwmod_44xx_data.c
index 43eebf2..8f13f4a 100644
--- a/arch/arm/mach-omap2/omap_hwmod_44xx_data.c
+++ b/arch/arm/mach-omap2/omap_hwmod_44xx_data.c
@@ -3915,21 +3915,11 @@ static struct omap_hwmod_ocp_if 
omap44xx_l4_per__dss_venc = {
.user   = OCP_USER_MPU,
 };
 
-static struct omap_hwmod_addr_space omap44xx_elm_addrs[] = {
-   {
-   .pa_start   = 0x48078000,
-   .pa_end = 0x48078fff,
-   .flags  = ADDR_TYPE_RT
-   },
-   { }
-};
-
 /* l4_per -> elm */
 static struct omap_hwmod_ocp_if omap44xx_l4_per__elm = {
.master = &omap44xx_l4_per_hwmod,
.slave  = &omap44xx_elm_hwmod,
.clk= "l4_div_ck",
-   .addr   = omap44xx_elm_addrs,
.user   = OCP_USER_MPU | OCP_USER_SDMA,
 };
 
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v5 2/3] USB: serial: cp210x: Relocated private data from USB interface to port

2015-10-28 Thread Konstantin Shkolnyy

This change is preparation for implementing a cp2108 bug workaround.
The workaround requires storing some private data. Right now the data is
attached to the USB interface and allocated in the attach() callback.
The bug detection requires USB I/O which is done easier from port_probe()
callback rather than attach(). Since the USB access functions take port
as a parameter, and since the private data is used exclusively by these
functions, it can be allocated in port_probe(). Also, all cp210x devices
have exactly 1 port per USB iterface, so moving private data from the USB
interface to port is trivial.

Signed-off-by: Konstantin Shkolnyy 
---
 drivers/usb/serial/cp210x.c | 43 +++
 1 file changed, 23 insertions(+), 20 deletions(-)

diff --git a/drivers/usb/serial/cp210x.c b/drivers/usb/serial/cp210x.c
index 8ba1005..352fe63 100644
--- a/drivers/usb/serial/cp210x.c
+++ b/drivers/usb/serial/cp210x.c
@@ -43,8 +43,8 @@ static int cp210x_tiocmset(struct tty_struct *, unsigned int, 
unsigned int);
 static int cp210x_tiocmset_port(struct usb_serial_port *port,
unsigned int, unsigned int);
 static void cp210x_break_ctl(struct tty_struct *, int);
-static int cp210x_startup(struct usb_serial *);
-static void cp210x_release(struct usb_serial *);
+static int cp210x_port_probe(struct usb_serial_port *);
+static int cp210x_port_remove(struct usb_serial_port *);
 static void cp210x_dtr_rts(struct usb_serial_port *p, int on);
 
 static const struct usb_device_id id_table[] = {
@@ -197,7 +197,7 @@ static const struct usb_device_id id_table[] = {
 
 MODULE_DEVICE_TABLE(usb, id_table);
 
-struct cp210x_serial_private {
+struct cp210x_port_private {
__u8bInterfaceNumber;
 };
 
@@ -216,8 +216,8 @@ static struct usb_serial_driver cp210x_device = {
.set_termios= cp210x_set_termios,
.tiocmget   = cp210x_tiocmget,
.tiocmset   = cp210x_tiocmset,
-   .attach = cp210x_startup,
-   .release= cp210x_release,
+   .port_probe = cp210x_port_probe,
+   .port_remove= cp210x_port_remove,
.dtr_rts= cp210x_dtr_rts
 };
 
@@ -319,7 +319,7 @@ static int cp210x_get_config(struct usb_serial_port *port, 
u8 request,
unsigned int *data, int size)
 {
struct usb_serial *serial = port->serial;
-   struct cp210x_serial_private *spriv = usb_get_serial_data(serial);
+   struct cp210x_port_private *port_priv = usb_get_serial_port_data(port);
__le32 *buf;
int result, i, length;
 
@@ -333,7 +333,7 @@ static int cp210x_get_config(struct usb_serial_port *port, 
u8 request,
/* Issue the request, attempting to read 'size' bytes */
result = usb_control_msg(serial->dev, usb_rcvctrlpipe(serial->dev, 0),
request, REQTYPE_INTERFACE_TO_HOST, 0x,
-   spriv->bInterfaceNumber, buf, size,
+   port_priv->bInterfaceNumber, buf, size,
USB_CTRL_GET_TIMEOUT);
 
/* Convert data into an array of integers */
@@ -364,7 +364,7 @@ static int cp210x_set_config(struct usb_serial_port *port, 
u8 request,
unsigned int *data, int size)
 {
struct usb_serial *serial = port->serial;
-   struct cp210x_serial_private *spriv = usb_get_serial_data(serial);
+   struct cp210x_port_private *port_priv = usb_get_serial_port_data(port);
__le32 *buf;
int result, i, length;
 
@@ -383,13 +383,13 @@ static int cp210x_set_config(struct usb_serial_port 
*port, u8 request,
result = usb_control_msg(serial->dev,
usb_sndctrlpipe(serial->dev, 0),
request, REQTYPE_HOST_TO_INTERFACE, 0x,
-   spriv->bInterfaceNumber, buf, size,
+   port_priv->bInterfaceNumber, buf, size,
USB_CTRL_SET_TIMEOUT);
} else {
result = usb_control_msg(serial->dev,
usb_sndctrlpipe(serial->dev, 0),
request, REQTYPE_HOST_TO_INTERFACE, data[0],
-   spriv->bInterfaceNumber, NULL, 0,
+   port_priv->bInterfaceNumber, NULL, 0,
USB_CTRL_SET_TIMEOUT);
}
 
@@ -878,29 +878,32 @@ static void cp210x_break_ctl(struct tty_struct *tty, int 
break_state)
cp210x_set_config(port, CP210X_SET_BREAK, &state, 2);
 }
 
-static int cp210x_startup(struct usb_serial *serial)
+static int cp210x_port_probe(struct usb_serial_port *port)
 {
+   struct usb_serial *serial = port->serial;
struct usb_host_interface *cur_altsetting;
-   struct cp210x_serial_private *spriv;
+   struct cp210x_port_private *port_priv;
 
-   spr

Re: [PATCH v2 2/5] dax: increase granularity of dax_clear_blocks() operations

2015-10-28 Thread Jeff Moyer

"Williams, Dan J"  writes:

> The problem is that intervening call to cond_resched().  I later want to
> inject an rcu_read_lock()/unlock() pair to allow flushing active
> dax_map_atomic() usages at driver teardown time [1].  But, I think the
> patch stands alone as a cleanup outside of that admittedly hidden
> motivation.

I'm not going to split hairs.

Reviewed-by: Jeff Moyer 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/2] Remove elm address space from omap4 hwmod

2015-10-28 Thread Franklin S Cooper Jr

This patchset removes elm address entry from omap4 hwmod and adds
an elm DT node to omap4.dtsi.

Since no omap4 supports nand in mainline this patchset was boot
tested on a pandaboard.

Franklin S Cooper Jr (2):
  ARM: dts: omap4: Add elm node
  ARM: omap4: hwmod: Remove elm address space from hwmod data

 arch/arm/boot/dts/omap4.dtsi   |  8 
 arch/arm/mach-omap2/omap_hwmod_44xx_data.c | 10 --
 2 files changed, 8 insertions(+), 10 deletions(-)

-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/5] mtd: ofpart: grab device tree node directly from master device node

2015-10-28 Thread Robert Jarzmik

Brian Norris  writes:

>> > 
>> > Do some sorts of chipselects come into play here ? Ie. you can have one 
>> > master
>> > with multiple NAND chips connected to it.
>> 
>> Most NAND controllers support interacting with several chips (or
>> dies in case your chip embeds several NAND dies), but I keep thinking
>> each physical chip should have its own instance of nand_chip + mtd_info.
>> If you want to have a single mtd device aggregating several chips you
>> can use mtdconcat.
>> 
>> This leaves the multi-dies chip case, and IHMO we should represent those
>> chips as a single entity, and I guess that's the purpose of the
>> ->numchips field in nand_chip (if your chip embeds 2 dies with 2 CS
>> lines, then ->numchips should be 2).
> Yes, I think that's some of the intention there. And so even in that
> case, a multi-die chip gets represented as a single struct nand_chip.

Isn't there the case of a single NAND controller with 2 identical chips, each a
8 bit NAND chip, and the controller aggregating them to offer the OS a single
16-bit NAND chip ?

In this case, the controller (pxa3xx is a good example) will be programmed to
handle both chips at the same time, and calculate CRC on both chips, etc ... I
hope the assertion "physical chip should have its own instance of nand_chip +
mtd_info" does take into account this example.

I don't know if there is actually any user of this for either pxa3xx or another
controller, nor if there is any value in this.

Cheers.

-- 
Robert
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/5] pmem, dax: clean up clear_pmem()

2015-10-28 Thread Jeff Moyer

Dan Williams  writes:

> On Thu, Oct 22, 2015 at 1:48 PM, Jeff Moyer  wrote:
>> Dan Williams  writes:
>>
>>> Both, __dax_pmd_fault, and clear_pmem() were taking special steps to
>>> clear memory a page at a time to take advantage of non-temporal
>>> clear_page() implementations.  However, x86_64 does not use
>>> non-temporal instructions for clear_page(), and arch_clear_pmem() was
>>> always incurring the cost of __arch_wb_cache_pmem().
>>>
>>> Clean up the assumption that doing clear_pmem() a page at a time is more
>>> performant.
>>
>> Wouldn't another solution be to actually use non-temporal stores?
>
> Sure.
>
>> Why did you choose to punt?
>
> Just a priority call at this point.  Patches welcome of course ;-).

OK.  Patch is harmless.

Reviewed-by: Jeff Moyer 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: tools/build: fixdep versus tools/lib/bpf

2015-10-28 Thread Arnaldo Carvalho de Melo

Em Wed, Oct 28, 2015 at 09:44:50PM +0100, Jiri Olsa escreveu:
> On Wed, Oct 28, 2015 at 05:37:52PM -0300, Arnaldo Carvalho de Melo wrote:
> > Em Wed, Oct 28, 2015 at 09:13:52PM +0100, Jiri Olsa escreveu:
> > > On Wed, Oct 28, 2015 at 01:30:40PM -0300, Arnaldo Carvalho de Melo wrote:
> > > > Hi Jiri, Wang,
> > > > 
> > > > I'm getting these while doing 'make -C tools/perf build-test':
> > > > 
> > > >   LD   fixdep-in.o
> > > >   LINK fixdep
> > > > /bin/sh: /home/acme/git/linux/tools/build/fixdep: Permission denied
> > > > make[6]: *** [bpf.o] Error 1
> > > > make[5]: *** [libbpf-in.o] Error 2
> > > > make[4]: *** [/home/acme/git/linux/tools/lib/bpf/libbpf.a] Error 2
> > > > make[4]: *** Waiting for unfinished jobs
> > > > 
> > > > 
> > > > It happens at different tests, i.e. seems like a race somewhere in the
> > > > build system, can you take a look? It happens with my perf/ebpf branch.
> > > 
> > > could not reproduce, but looks like attached patch should help
> > 
> > I'll test this now, i.e. make it go thru a 'make -C tools/perf
> > build-test'.
> > 
> > In the interest of speeding up things, please provide an explanation of
> > why this should be applied, so that I can add it to the changeset log.
> > 
> > Thanks a bunch!
> > 
> 
> The fixdep tool needs to be built as the first binary.
> Libraries are built in paralel, so each of them needs
> to depend on fixdep target.


I really need a faster machine, you provided the answer at this point:

- make_tags_O: cd . && make -f Makefile O=/tmp/tmp.xfRz6THR6o
  DESTDIR=/tmp/tmp.y0nuN0Fr9n tags
- make_cscope_O: cd . && make -f Makefile O=/tmp/tmp.O6phQXHU4z
  DESTDIR=/tmp/tmp.5mdMeF1pH2 cscope
- tarpkg: ./tests/perf-targz-src-pkg .
- make -C  tools/perf
- make -C /tools perf

Almost there :-)

- Arnaldo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 0/4] hugetlbfs fallocate hole punch race with page faults

2015-10-28 Thread Hugh Dickins

On Wed, 28 Oct 2015, Mike Kravetz wrote:
> On 10/27/2015 08:34 PM, Hugh Dickins wrote:
> 
> Thanks for the detailed response Hugh.  I will try to address your questions
> and provide more reasoning behind the use case and need for this code.

And thank you for your detailed response, Mike: that helped a lot.

> Ok, here is a bit more explanation of the proposed use case.  It all
> revolves around a DB's use of hugetlbfs and the desire for more control
> over the underlying memory.  This additional control is achieved by
> adding existing fallocate and userfaultfd semantics to hugetlbfs.
> 
> In this use case there is a single process that manages hugetlbfs files
> and the underlying memory resources.  It pre-allocates/initializes these
> files.
> 
> In addition, there are many other processes which access (rw mode) these
> files.  They will simply mmap the files.  It is expected that they will
> not fault in any new pages.  Rather, all pages would have been pre-allocated
> by the management process.
> 
> At some time, the management process determines that specific ranges of
> pages within the hugetlbfs files are no longer needed.  It will then punch
> holes in the files.  These 'free' pages within the holes may then be used
> for other purposes.  For applications like this (sophisticated DBs), huge
> pages are reserved at system init time and closely managed by the
> application.
> Hence, the desire for this additional control.
> 
> So, when a hole containing N huge pages is punched, the management process
> wants to know that it really has N huge pages for other purposes.  Ideally,
> none of the other processes mapping this file/area would access the hole.
> This is an application error, and it can be 'caught' with  userfaultfd.
> 
> Since these other (non-management) processes will never fault in pages,
> they would simply set up userfaultfd to catch any page faults immediately
> after mmaping the hugetlbfs file.
> 
> > 
> > But it sounds to me more as if the holes you want punched are not
> > quite like on other filesystems, and you want to be able to police
> > them afterwards with userfaultfd, to prevent them from being refilled.
> 
> I am not sure if they are any different.
> 
> One could argue that a hole punch operation must always result in all
> pages within the hole being deallocated.  As you point out, this could
> race with a fault.  Previously, there would be no way to determine if
> all pages had been deallocated because user space could not detect this
> race.  Now, userfaultfd allows user space to catch page faults.  So,
> it is now possible to catch/depend on hole punch deallocating all pages
> within the hole.
> 
> > 
> > Can't userfaultfd be used just slightly earlier, to prevent them from
> > being filled while doing the holepunch?  Then no need for this patchset?
> 
> I do not think so, at least with current userfaultfd semantics.  The hole
> needs to be punched before being caught with UFFDIO_REGISTER_MODE_MISSING.

Great, that makes sense.

I was worried that you needed some kind of atomic treatment of the whole
extent punched, but all you need is to close the hole/fault race one
hugepage at a time.

Throw away all of 1/4, 2/4, 3/4: I think all you need is your 4/4
(plus i_mmap_lock_write around the hugetlb_vmdelete_list of course).

There you already do the single hugepage hugetlb_vmdelete_list()
under mutex_lock(&hugetlb_fault_mutex_table[hash]).

And it should come as no surprise that hugetlb_fault() does most
of its work under that same mutex.

So once remove_inode_hugepages() unlocks the mutex, that page is gone
from the file, and userfaultfd UFFDIO_REGISTER_MODE_MISSING will do
what you want, won't it?

I don't think "my" code buys you anything at all: you're not in danger of
shmem's starvation livelock issue, partly because remove_inode_hugepages()
uses the simple loop from start to end, and partly because hugetlb_fault()
already takes the serializing mutex (no equivalent in shmem_fault()).

Or am I dreaming?

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:core/rcu] fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in bdi_split_work_to_wbs()

2015-10-28 Thread tip-bot for Tejun Heo

Commit-ID:  b33e18f61bd18227a456016a77b1a968f5bc1d65
Gitweb: http://git.kernel.org/tip/b33e18f61bd18227a456016a77b1a968f5bc1d65
Author: Tejun Heo 
AuthorDate: Tue, 27 Oct 2015 14:19:39 +0900
Committer:  Ingo Molnar 
CommitDate: Wed, 28 Oct 2015 13:17:30 +0100

fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in 
bdi_split_work_to_wbs()

bdi_split_work_to_wbs() uses list_for_each_entry_rcu_continue()
to walk @bdi->wb_list.  To set up the initial iteration
condition, it uses list_entry_rcu() to calculate the entry
pointer corresponding to the list head; however, this isn't an
actual RCU dereference and using list_entry_rcu() for it ended
up breaking a proposed list_entry_rcu() change because it was
feeding an non-lvalue pointer into the macro.

Don't use the RCU variant for simple pointer offsetting.  Use
list_entry() instead.

Reported-by: Ingo Molnar 
Signed-off-by: Tejun Heo 
Cc: Darren Hart 
Cc: David Howells 
Cc: Dipankar Sarma 
Cc: Eric Dumazet 
Cc: Frederic Weisbecker 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Linus Torvalds 
Cc: Mathieu Desnoyers 
Cc: Oleg Nesterov 
Cc: Patrick Marlier 
Cc: Paul McKenney 
Cc: Peter Zijlstra 
Cc: Steven Rostedt 
Cc: Thomas Gleixner 
Cc: pranith kumar 
Link: http://lkml.kernel.org/r/20151027051939.ga19...@mtj.duckdns.org
Signed-off-by: Ingo Molnar 
---
 fs/fs-writeback.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 29e4599..7378169 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -779,8 +779,8 @@ static void bdi_split_work_to_wbs(struct backing_dev_info 
*bdi,
  bool skip_if_busy)
 {
struct bdi_writeback *last_wb = NULL;
-   struct bdi_writeback *wb = list_entry_rcu(&bdi->wb_list,
-   struct bdi_writeback, bdi_node);
+   struct bdi_writeback *wb = list_entry(&bdi->wb_list,
+ struct bdi_writeback, bdi_node);
 
might_sleep();
 restart:
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

hw_breakpoint.c -- a bad idea

2015-10-28 Thread Jeffrey Merkey

Allowing other subsystems to clobber the DR7 register is a bad idea
when debuggers are loaded.  I realize that this piece of code was
written for perf monitors, and poor KGDB and KDB have to use this
interface to set and clear hardware breakpoints.  Problem with this
architecture is that is you are using complex chaining of conditional
breakpoints in a debugger monitoring cross processor events, this code
has no way of knowing the state of subsystems using the breakpoints
while it blindly clears the DR7 register during a breakpoint
exception.   It needs to be either disabled or the ability of a
subsystem debugger to flag it as "owned" for a specific period of
time.

I realize that the way its coded is designed to share the DR
registers, but if a debugger is active, this code is completely
disabled anyway, so having it is pointless except for the perf
functions when no debugger is active.  I am debating adapting MDB to
use this interface rather than manage breakpoint registers itself, but
I don't see the point in dancing around this code and trying to stand
on my head and implement something that removes features and
capability from the debugger.  MDB supports complex SMP conditional
breakpoints, unlike KDB or KGDB.

This interface should be restructured to allow it to be completely
disabled if necessary by the debugger or owned and a snipet of code in
an event handler should not be blindly clobbering DR7 without knowing
the state of debuggers above it.  I can move the functionality into
this section but is kind of defeats the purpose of having a self
contained debugger.  A debugger should have minimal dependencies on OS
code and be as much a self contained bubble as possible.

At any rate I disable it whem MDB is loaded.  I will investigate a
batter way to do this to allow perf code to run as well.

Jeff
static int hw_breakpoint_handler(struct die_args *args)
{
int i, cpu, rc = NOTIFY_STOP;
struct perf_event *bp;
unsigned long dr7, dr6;
unsigned long *dr6_p;

/* The DR6 value is pointed by args->err */
dr6_p = (unsigned long *)ERR_PTR(args->err);
dr6 = *dr6_p;

/* If it's a single step, TRAP bits are random */
if (dr6 & DR_STEP)
return NOTIFY_DONE;

/* Do an early return if no trap bits are set in DR6 */
if ((dr6 & DR_TRAP_BITS) == 0)
return NOTIFY_DONE;

// if MDB is loaded turn off
#if defined(CONFIG_MDB) || defined(CONFIG_MDB_MODULE)
if (disable_hw_bp_interface)
return NOTIFY_DONE;
#endif

get_debugreg(dr7, 7);
/* Disable breakpoints during exception handling */

set_debugreg(0UL, 7);   //   <---  fucking broken, breaks
proceed, conditional, and temp breakpoints in MDB

/*
 * Assert that local interrupts are disabled
 * Reset the DRn bits in the virtualized register value.
 * The ptrace trigger routine will add in whatever is needed.
 */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 5/5] block: enable dax for raw block devices

2015-10-28 Thread Jeff Moyer

Dan Williams  writes:

> If an application wants exclusive access to all of the persistent memory
> provided by an NVDIMM namespace it can use this raw-block-dax facility
> to forgo establishing a filesystem.  This capability is targeted
> primarily to hypervisors wanting to provision persistent memory for
> guests.

OK, I'm going to expose my ignorance here.  :)  Why does the block device
need a page_mkwrite handler?

-Jeff

> Cc: Jeff Moyer 
> Cc: Christoph Hellwig 
> Cc: Dave Chinner 
> Cc: Andrew Morton 
> Cc: Ross Zwisler 
> Reviewed-by: Jan Kara 
> Signed-off-by: Dan Williams 
> ---
>  fs/block_dev.c |   60 
> +++-
>  1 file changed, 59 insertions(+), 1 deletion(-)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index c1f691859a56..210d05103657 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1687,13 +1687,71 @@ static const struct address_space_operations 
> def_blk_aops = {
>   .is_dirty_writeback = buffer_check_dirty_writeback,
>  };
>  
> +#ifdef CONFIG_FS_DAX
> +/*
> + * In the raw block case we do not need to contend with truncation nor
> + * unwritten file extents.  Without those concerns there is no need for
> + * additional locking beyond the mmap_sem context that these routines
> + * are already executing under.
> + *
> + * Note, there is no protection if the block device is dynamically
> + * resized (partition grow/shrink) during a fault. A stable block device
> + * size is already not enforced in the blkdev_direct_IO path.
> + *
> + * For DAX, it is the responsibility of the block device driver to
> + * ensure the whole-disk device size is stable while requests are in
> + * flight.
> + *
> + * Finally, in contrast to the generic_file_mmap() path, there are no
> + * calls to sb_start_pagefault().  That is meant to synchronize write
> + * faults against requests to freeze the contents of the filesystem
> + * hosting vma->vm_file.  However, in the case of a block device special
> + * file, it is a 0-sized device node usually hosted on devtmpfs, i.e.
> + * nothing to do with the super_block for bdev_file_inode(vma->vm_file).
> + * We could call get_super() in this path to retrieve the right
> + * super_block, but the generic_file_mmap() path does not do this for
> + * the CONFIG_FS_DAX=n case.
> + */
> +static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> + return __dax_fault(vma, vmf, blkdev_get_block, NULL);
> +}
> +
> +static int blkdev_dax_pmd_fault(struct vm_area_struct *vma, unsigned long 
> addr,
> + pmd_t *pmd, unsigned int flags)
> +{
> + return __dax_pmd_fault(vma, addr, pmd, flags, blkdev_get_block, NULL);
> +}
> +
> +static const struct vm_operations_struct blkdev_dax_vm_ops = {
> + .page_mkwrite   = blkdev_dax_fault,
> + .fault  = blkdev_dax_fault,
> + .pmd_fault  = blkdev_dax_pmd_fault,
> +};
> +
> +static int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + struct inode *bd_inode = bdev_file_inode(file);
> +
> + if (!IS_DAX(bd_inode))
> + return generic_file_mmap(file, vma);
> +
> + file_accessed(file);
> + vma->vm_ops = &blkdev_dax_vm_ops;
> + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
> + return 0;
> +}
> +#else
> +#define blkdev_mmap generic_file_mmap
> +#endif
> +
>  const struct file_operations def_blk_fops = {
>   .open   = blkdev_open,
>   .release= blkdev_close,
>   .llseek = block_llseek,
>   .read_iter  = blkdev_read_iter,
>   .write_iter = blkdev_write_iter,
> - .mmap   = generic_file_mmap,
> + .mmap   = blkdev_mmap,
>   .fsync  = blkdev_fsync,
>   .unlocked_ioctl = block_ioctl,
>  #ifdef CONFIG_COMPAT
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Triggering non-integrity writeback from userspace

2015-10-28 Thread Dave Chinner

Hi Andres,

On Wed, Oct 28, 2015 at 10:27:52AM +0100, Andres Freund wrote:
> On 2015-10-25 08:39:12 +1100, Dave Chinner wrote:

> > Data integrity operations require related file metadata (e.g. block
> > allocation trnascations) to be forced to the journal/disk, and a
> > device cache flush issued to ensure the data is on stable storage.
> > SYNC_FILE_RANGE_WRITE does neither of these things, and hence while
> > the IO might be the same pattern as a data integrity operation, it
> > does not provide such guarantees.
> 
> Which is desired here - the actual integrity is still going to be done
> via fsync().

OK, so you require data integrity, but

> The idea of using SYNC_FILE_RANGE_WRITE beforehand is that
> the fsync() will only have to do very little work. The language in
> sync_file_range(2) doesn't inspire enough confidence for using it as an
> actual integrity operation :/

So really you're trying to minimise the blocking/latency of fsync()?

> > You don't want to do writeback from the syscall, right? i.e. you'd
> > like to expire the inode behind the fd, and schedule background
> > writeback to run on it immediately?
> 
> Yes, that's exactly what we want. Blocking if a process has done too
> much writes is fine tho.

OK, so it's really the latency of the fsync() operation that is what
you are trying to avoid? I've been meaning to get back to a generic
implementation of an aio fsync operation:

http://oss.sgi.com/archives/xfs/2014-06/msg00214.html

Would that be a better approach to solving your need for a
non-blocking data integrity flush of a file?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

< 1 2 3 4 5 6 7 8 9 10 >

101 - 200 of 929 matches

Mail list logo