Re: [PATCH 1/1] ARM: keystone: defconfig: Fix USB configuration

2016-08-17 Thread Santosh Shilimkar

Hi Arnd, Olof,

Can you please pick-up the fix for 4.8-rcx ?
Roger reported that USB ports are broken on Keystone2 boards
since v4.8-rc1 because USB_HPY config option got dropped.


On 8/17/2016 3:44 AM, Roger Quadros wrote:

Simply enabling CONFIG_KEYSTONE_USB_PHY doesn't work anymore
as it depends on CONFIG_NOP_USB_XCEIV. We need to enable
that as well.

This fixes USB on Keystone boards from v4.8-rc1 onwards.

Signed-off-by: Roger Quadros 
---

Acked-by: Santosh Shilimkar 



 arch/arm/configs/keystone_defconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/configs/keystone_defconfig 
b/arch/arm/configs/keystone_defconfig
index 71b42e6..78cd2f1 100644
--- a/arch/arm/configs/keystone_defconfig
+++ b/arch/arm/configs/keystone_defconfig
@@ -161,6 +161,7 @@ CONFIG_USB_MON=y
 CONFIG_USB_XHCI_HCD=y
 CONFIG_USB_STORAGE=y
 CONFIG_USB_DWC3=y
+CONFIG_NOP_USB_XCEIV=y
 CONFIG_KEYSTONE_USB_PHY=y
 CONFIG_NEW_LEDS=y
 CONFIG_LEDS_CLASS=y



Re: [PATCH 2/2] remoteproc: core: Rework obtaining a rproc from a DT phandle

2016-08-10 Thread Santosh Shilimkar

+Suman,

On 8/10/2016 10:15 AM, Bjorn Andersson wrote:

On Tue 19 Jul 08:49 PDT 2016, Lee Jones wrote:


In this patch we;
 - Use a subsystem generic phandle to obtain an rproc
   - We have to support TI's bespoke version for the time being
 - Convert wkup_m3_ipc driver to new API
 - Rename the call to be more like other, similar OF calls
 - Move feature-not-enabled inline stub to the headers
 - Strip out duplicate code by calling into of_get_rproc_by_index()

Signed-off-by: Lee Jones 
---
 drivers/remoteproc/remoteproc_core.c | 41 
 drivers/soc/ti/wkup_m3_ipc.c | 14 +++-
 include/linux/remoteproc.h   |  4 ++--
 3 files changed, 14 insertions(+), 45 deletions(-)


[..]

diff --git a/drivers/soc/ti/wkup_m3_ipc.c b/drivers/soc/ti/wkup_m3_ipc.c
index 8823cc8..15481f3 100644
--- a/drivers/soc/ti/wkup_m3_ipc.c
+++ b/drivers/soc/ti/wkup_m3_ipc.c
@@ -385,7 +385,6 @@ static int wkup_m3_ipc_probe(struct platform_device *pdev)
 {
struct device *dev = &pdev->dev;
int irq, ret;
-   phandle rproc_phandle;
struct rproc *m3_rproc;
struct resource *res;
struct task_struct *task;
@@ -430,16 +429,9 @@ static int wkup_m3_ipc_probe(struct platform_device *pdev)
return PTR_ERR(m3_ipc->mbox);
}

-   if (of_property_read_u32(dev->of_node, "ti,rproc", &rproc_phandle)) {
-   dev_err(&pdev->dev, "could not get rproc phandle\n");
-   ret = -ENODEV;
-   goto err_free_mbox;
-   }
-
-   m3_rproc = rproc_get_by_phandle(rproc_phandle);
-   if (!m3_rproc) {
-   dev_err(&pdev->dev, "could not get rproc handle\n");
-   ret = -EPROBE_DEFER;
+   m3_rproc = of_get_rproc_by_phandle(dev->of_node);
+   if (IS_ERR(m3_rproc)) {
+   ret = PTR_ERR(m3_rproc);
goto err_free_mbox;
}



Santosh, do you have any objections to me merging this?


This looks ok to me but I have not been merdging the remote
proc code. Looping Suman who IIRC, was looking at it along
with Ohad.


of_get_rproc_by_phandle() will fall back and attempt to acquire the
handle from ti,rproc if the generic "rprocs" property doesn't exist.



Suman,
Can you please check this series and see if you can line this up ?
Am not sure if Ohad still maintaining it.

Regards,
Santosh


Re: [PATCH v2 0/3] Add DSP control nodes for K2G

2016-08-09 Thread Santosh Shilimkar

On 8/9/2016 7:33 AM, Andrew F. Davis wrote:

Hello all,

This series adds the nodes needed to control the DSP available on
this SoC. These are similar to the nodes already present the
other K2x SoCs.

Thanks,
Andrew

Andrew F. Davis (3):
  ARM: dts: keystone-k2g: Add device state controller node
  ARM: dts: keystone-k2g: Add keystone IRQ controller node
  ARM: dts: keystone-k2g: Add DSP GPIO controller node


The series looks good to me. I will add them to the next merge
window queue...


Re: [PATCH] device probe: add self triggered delayed work request

2016-08-08 Thread Santosh Shilimkar



On 8/8/2016 6:11 PM, Frank Rowand wrote:

On 08/08/16 14:51, Qing Huang wrote:



On 08/08/2016 01:44 PM, Frank Rowand wrote:

On 07/29/16 22:39, Qing Huang wrote:

In normal condition, the device probe requests kept in deferred
queue would only be triggered for re-probing when another new device
probe is finished successfully. This change will set up a delayed
trigger work request if the current deferred probe being added is
the only one in the queue. This delayed work request will try to
reactivate any device from the deferred queue for re-probing later.

By doing this, if the last device being probed in system boot process
has a deferred probe error, this particular device will still be able
to be probed again.

I am trying to understand the use case.

Can you explain the scenario you are trying to fix?  If I understand
correctly, you expect that something will change such that a later
probe attempt will succeed.  How will that change occur and why
will the deferred probe list not be processed in this case?

Why are you conditioning this on the deferred_probe_pending_list
being empty?

-Frank


It turns out one corner case which we worried about has already been
solved in the really_probe() function by comparing
'deferred_trigger_count' values.

Another use case we are investigating now: when we probe a device,
the main thread returns EPROBE_DEFER from the driver after we spawn a
child thread to do the actual init work. So we can initialize
multiple similar devices at the same time. After the child thread
finishes its task, we can call driver_deferred_probe_trigger()
directly from child thread to re-probe the
device(driver_deferred_probe_trigger() has to be exported though). Or
we could rely on something in this patch to re-probe the deferred
devices from the pending list...
What do you suggest?


See commit 735a7ffb739b6efeaeb1e720306ba308eaaeb20e for how multi-threaded
probes were intended to be handled.  I don't know if this approach is used
much or even usable, but that is the framework that was created.


That infrastructure got removed as part of below commit :-(

commit 5adc55da4a7758021bcc374904b0f8b076508a11
Author: Adrian Bunk 
Date:   Tue Mar 27 03:02:51 2007 +0200

PCI: remove the broken PCI_MULTITHREAD_PROBE option

This patch removes the PCI_MULTITHREAD_PROBE option that had already
been marked as broken.

Signed-off-by: Adrian Bunk 
Signed-off-by: Greg Kroah-Hartman 




Re: [RFC PATCH] softirq: fix tasklet_kill() usage and users

2016-08-06 Thread Santosh Shilimkar

ping !!

On 8/1/2016 9:13 PM, Santosh Shilimkar wrote:

Semantically the expectation from the tasklet init/kill API
should be as below.

tasklet_init() == Init and Enable scheduling
tasklet_kill() == Disable scheduling and Destroy

tasklet_init() API exibit above behavior but not the
tasklet_kill(). The tasklet handler can still get scheduled
and run even after the tasklet_kill().

There are 2, 3 places where drivers are working around
this issue by calling tasklet_disable() which will add an
usecount and there by avoiding the handlers being called.
One of the example 'commit 1e1257860fd1
("tty/serial: at91: correct the usage of tasklet")'

tasklet_enable/tasklet_disable is a pair API and expected
to be used together. Usage of tasklet_disable() *just* to
workround tasklet scheduling after kill is probably not the
correct and inteded use of the API as done the API.
We also happen to see similar issue where in shutdown path
the tasklet_handler was getting called even after the
tasklet_kill().

We can fix this be making sure tasklet_kill() does right
thing and there by ensuring tasklet handler won't run after
tasklet_kil() with very simple change. Patch fixes the tasklet
code and also few drivers hacks to workaround the issue.

Cc: Greg Kroah-Hartman 
Cc: Andrew Morton 
Cc: Thomas Gleixner 
Cc: Tadeusz Struk 
Cc: Herbert Xu 
Cc: "David S. Miller" 
Cc: Paul Bolle 
Cc: Nicolas Ferre 

Signed-off-by: Santosh Shilimkar 
---
 drivers/crypto/qat/qat_common/adf_isr.c| 1 -
 drivers/crypto/qat/qat_common/adf_sriov.c  | 1 -
 drivers/crypto/qat/qat_common/adf_vf_isr.c | 2 --
 drivers/isdn/gigaset/interface.c   | 1 -
 drivers/tty/serial/atmel_serial.c  | 1 -
 kernel/softirq.c   | 7 ---
 6 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/drivers/crypto/qat/qat_common/adf_isr.c 
b/drivers/crypto/qat/qat_common/adf_isr.c
index 06d4901..fd5e900 100644
--- a/drivers/crypto/qat/qat_common/adf_isr.c
+++ b/drivers/crypto/qat/qat_common/adf_isr.c
@@ -296,7 +296,6 @@ static void adf_cleanup_bh(struct adf_accel_dev *accel_dev)
int i;

for (i = 0; i < hw_data->num_banks; i++) {
-   tasklet_disable(&priv_data->banks[i].resp_handler);
tasklet_kill(&priv_data->banks[i].resp_handler);
}
 }
diff --git a/drivers/crypto/qat/qat_common/adf_sriov.c 
b/drivers/crypto/qat/qat_common/adf_sriov.c
index 4a526e2..9e65888 100644
--- a/drivers/crypto/qat/qat_common/adf_sriov.c
+++ b/drivers/crypto/qat/qat_common/adf_sriov.c
@@ -204,7 +204,6 @@ void adf_disable_sriov(struct adf_accel_dev *accel_dev)
}

for (i = 0, vf = accel_dev->pf.vf_info; i < totalvfs; i++, vf++) {
-   tasklet_disable(&vf->vf2pf_bh_tasklet);
tasklet_kill(&vf->vf2pf_bh_tasklet);
mutex_destroy(&vf->pf2vf_lock);
}
diff --git a/drivers/crypto/qat/qat_common/adf_vf_isr.c 
b/drivers/crypto/qat/qat_common/adf_vf_isr.c
index aa689ca..81e63bf 100644
--- a/drivers/crypto/qat/qat_common/adf_vf_isr.c
+++ b/drivers/crypto/qat/qat_common/adf_vf_isr.c
@@ -191,7 +191,6 @@ static int adf_setup_pf2vf_bh(struct adf_accel_dev 
*accel_dev)

 static void adf_cleanup_pf2vf_bh(struct adf_accel_dev *accel_dev)
 {
-   tasklet_disable(&accel_dev->vf.pf2vf_bh_tasklet);
tasklet_kill(&accel_dev->vf.pf2vf_bh_tasklet);
mutex_destroy(&accel_dev->vf.vf2pf_lock);
 }
@@ -268,7 +267,6 @@ static void adf_cleanup_bh(struct adf_accel_dev *accel_dev)
 {
struct adf_etr_data *priv_data = accel_dev->transport;

-   tasklet_disable(&priv_data->banks[0].resp_handler);
tasklet_kill(&priv_data->banks[0].resp_handler);
 }

diff --git a/drivers/isdn/gigaset/interface.c b/drivers/isdn/gigaset/interface.c
index 600c79b..2ce63b6 100644
--- a/drivers/isdn/gigaset/interface.c
+++ b/drivers/isdn/gigaset/interface.c
@@ -524,7 +524,6 @@ void gigaset_if_free(struct cardstate *cs)
if (!drv->have_tty)
return;

-   tasklet_disable(&cs->if_wake_tasklet);
tasklet_kill(&cs->if_wake_tasklet);
cs->tty_dev = NULL;
tty_unregister_device(drv->tty, cs->minor_index);
diff --git a/drivers/tty/serial/atmel_serial.c 
b/drivers/tty/serial/atmel_serial.c
index 954941d..27e638e 100644
--- a/drivers/tty/serial/atmel_serial.c
+++ b/drivers/tty/serial/atmel_serial.c
@@ -1915,7 +1915,6 @@ static void atmel_shutdown(struct uart_port *port)
 * Clear out any scheduled tasklets before
 * we destroy the buffers
 */
-   tasklet_disable(&atmel_port->tasklet);
tasklet_kill(&atmel_port->tasklet);

/*
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b..21397eb 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -498,7 +498,7 @@ static void tasklet_action(struct softirq_action *a)

Re: [PATCH 1/1] RDS: add __printf format attribute to error reporting functions

2016-08-05 Thread Santosh Shilimkar

On 8/5/2016 1:11 PM, Nicolas Iooss wrote:

This is helpful to detect at compile-time errors related to format
strings.

Signed-off-by: Nicolas Iooss 
---

OK.
Acked-by: Santosh Shilimkar 


[RFC PATCH] softirq: fix tasklet_kill() usage and users

2016-08-01 Thread Santosh Shilimkar
Semantically the expectation from the tasklet init/kill API
should be as below.

tasklet_init() == Init and Enable scheduling
tasklet_kill() == Disable scheduling and Destroy

tasklet_init() API exibit above behavior but not the
tasklet_kill(). The tasklet handler can still get scheduled
and run even after the tasklet_kill().

There are 2, 3 places where drivers are working around
this issue by calling tasklet_disable() which will add an
usecount and there by avoiding the handlers being called.
One of the example 'commit 1e1257860fd1
("tty/serial: at91: correct the usage of tasklet")'

tasklet_enable/tasklet_disable is a pair API and expected
to be used together. Usage of tasklet_disable() *just* to
workround tasklet scheduling after kill is probably not the
correct and inteded use of the API as done the API.
We also happen to see similar issue where in shutdown path
the tasklet_handler was getting called even after the
tasklet_kill().

We can fix this be making sure tasklet_kill() does right
thing and there by ensuring tasklet handler won't run after
tasklet_kil() with very simple change. Patch fixes the tasklet
code and also few drivers hacks to workaround the issue.

Cc: Greg Kroah-Hartman 
Cc: Andrew Morton 
Cc: Thomas Gleixner 
Cc: Tadeusz Struk 
Cc: Herbert Xu 
Cc: "David S. Miller" 
Cc: Paul Bolle 
Cc: Nicolas Ferre 

Signed-off-by: Santosh Shilimkar 
---
 drivers/crypto/qat/qat_common/adf_isr.c| 1 -
 drivers/crypto/qat/qat_common/adf_sriov.c  | 1 -
 drivers/crypto/qat/qat_common/adf_vf_isr.c | 2 --
 drivers/isdn/gigaset/interface.c   | 1 -
 drivers/tty/serial/atmel_serial.c  | 1 -
 kernel/softirq.c   | 7 ---
 6 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/drivers/crypto/qat/qat_common/adf_isr.c 
b/drivers/crypto/qat/qat_common/adf_isr.c
index 06d4901..fd5e900 100644
--- a/drivers/crypto/qat/qat_common/adf_isr.c
+++ b/drivers/crypto/qat/qat_common/adf_isr.c
@@ -296,7 +296,6 @@ static void adf_cleanup_bh(struct adf_accel_dev *accel_dev)
int i;
 
for (i = 0; i < hw_data->num_banks; i++) {
-   tasklet_disable(&priv_data->banks[i].resp_handler);
tasklet_kill(&priv_data->banks[i].resp_handler);
}
 }
diff --git a/drivers/crypto/qat/qat_common/adf_sriov.c 
b/drivers/crypto/qat/qat_common/adf_sriov.c
index 4a526e2..9e65888 100644
--- a/drivers/crypto/qat/qat_common/adf_sriov.c
+++ b/drivers/crypto/qat/qat_common/adf_sriov.c
@@ -204,7 +204,6 @@ void adf_disable_sriov(struct adf_accel_dev *accel_dev)
}
 
for (i = 0, vf = accel_dev->pf.vf_info; i < totalvfs; i++, vf++) {
-   tasklet_disable(&vf->vf2pf_bh_tasklet);
tasklet_kill(&vf->vf2pf_bh_tasklet);
mutex_destroy(&vf->pf2vf_lock);
}
diff --git a/drivers/crypto/qat/qat_common/adf_vf_isr.c 
b/drivers/crypto/qat/qat_common/adf_vf_isr.c
index aa689ca..81e63bf 100644
--- a/drivers/crypto/qat/qat_common/adf_vf_isr.c
+++ b/drivers/crypto/qat/qat_common/adf_vf_isr.c
@@ -191,7 +191,6 @@ static int adf_setup_pf2vf_bh(struct adf_accel_dev 
*accel_dev)
 
 static void adf_cleanup_pf2vf_bh(struct adf_accel_dev *accel_dev)
 {
-   tasklet_disable(&accel_dev->vf.pf2vf_bh_tasklet);
tasklet_kill(&accel_dev->vf.pf2vf_bh_tasklet);
mutex_destroy(&accel_dev->vf.vf2pf_lock);
 }
@@ -268,7 +267,6 @@ static void adf_cleanup_bh(struct adf_accel_dev *accel_dev)
 {
struct adf_etr_data *priv_data = accel_dev->transport;
 
-   tasklet_disable(&priv_data->banks[0].resp_handler);
tasklet_kill(&priv_data->banks[0].resp_handler);
 }
 
diff --git a/drivers/isdn/gigaset/interface.c b/drivers/isdn/gigaset/interface.c
index 600c79b..2ce63b6 100644
--- a/drivers/isdn/gigaset/interface.c
+++ b/drivers/isdn/gigaset/interface.c
@@ -524,7 +524,6 @@ void gigaset_if_free(struct cardstate *cs)
if (!drv->have_tty)
return;
 
-   tasklet_disable(&cs->if_wake_tasklet);
tasklet_kill(&cs->if_wake_tasklet);
cs->tty_dev = NULL;
tty_unregister_device(drv->tty, cs->minor_index);
diff --git a/drivers/tty/serial/atmel_serial.c 
b/drivers/tty/serial/atmel_serial.c
index 954941d..27e638e 100644
--- a/drivers/tty/serial/atmel_serial.c
+++ b/drivers/tty/serial/atmel_serial.c
@@ -1915,7 +1915,6 @@ static void atmel_shutdown(struct uart_port *port)
 * Clear out any scheduled tasklets before
 * we destroy the buffers
 */
-   tasklet_disable(&atmel_port->tasklet);
tasklet_kill(&atmel_port->tasklet);
 
/*
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b..21397eb 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -498,7 +498,7 @@ static void tasklet_action(struct softirq_action *a)
list = list-&

Re: [PATCH V2 45/63] clocksource/drivers/timer-keystone: Convert init function to return error

2016-06-17 Thread Santosh Shilimkar

On 6/16/2016 2:27 PM, Daniel Lezcano wrote:

The init functions do not return any error. They behave as the following:

  - panic, thus leading to a kernel crash while another timer may work and
   make the system boot up correctly

  or

  - print an error and let the caller unaware if the state of the system

Change that by converting the init functions to return an error conforming
to the CLOCKSOURCE_OF_RET prototype.

Proper error handling (rollback, errno value) will be changed later case
by case, thus this change just return back an error or success in the init
function.

Signed-off-by: Daniel Lezcano 
---
 drivers/clocksource/timer-keystone.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)


Acked-by: Santosh Shilimkar 


Re: [PATCH] ARM: keystone: remove redundant depends on ARM_PATCH_PHYS_VIRT

2016-06-14 Thread Santosh Shilimkar

On 6/13/2016 11:17 PM, Masahiro Yamada wrote:

Hi Santosh

Ping again.

It is taking so long
for this apparently correct patch.


I thought it was already picked up. Will apply it
for next merge window.

Regards,
Santosh


Re: [PATCH v2 0/3] ARM: Keystone: Add pinmuxing support

2016-06-09 Thread Santosh Shilimkar

On 6/9/2016 8:26 AM, Franklin S Cooper Jr wrote:

Unlike most Keystone 2 devices, K2G supports pinmuxing of its pins. This
patch series enables pinmuxing for Keystone 2 devices.

Version 2 changes:
Rebased on top of linux-next which includes Keerthy patches.


Series applied. Should start showing up in linux-next soon.

Regards,
Santosh


Re: [PATCH 3/3] ARM: configs: keystone: Enable PINCTRL_SINGLE Config

2016-06-08 Thread Santosh Shilimkar

Franklin,

On 6/6/2016 9:00 AM, Santosh Shilimkar wrote:

On 6/5/2016 9:56 PM, Keerthy wrote:



[...]


Santosh,

I posted a consolidated series for k2l.


Thanks. Will pick that up.


Franklin,

Could you re-post k2g series on top of the series i posted today.


I have update the keystone 4.8 branches and linux-next branch.

Please refresh your DTS patches against [1] and post the same.




Regards,
Santosh

[1] 
git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux-keystone.git 
for_4.8/keystone_dts


Re: [PATCH] RDS: IB: Remove deprecated create_workqueue

2016-06-07 Thread Santosh Shilimkar

Hi,

On 6/7/2016 12:33 PM, Bhaktipriya Shridhar wrote:

alloc_workqueue replaces deprecated create_workqueue().

Since the driver is infiniband which can be used as block device and the
workqueue seems involved in regular operation of the device, so a
dedicated workqueue has been used  with WQ_MEM_RECLAIM set to guarantee
forward progress under memory pressure.



Since there are only a fixed number of work items, explicit concurrency
limit is unnecessary here.

Signed-off-by: Bhaktipriya Shridhar 
---

Looks fine.
Acked-by: Santosh Shilimkar 


Re: [RFC v2 4/4] ARM: keystone: dma-coherent with safe fallback

2016-06-06 Thread Santosh Shilimkar

(Joining discussion late since only this thread showed up in my
inbox)

On 6/6/2016 5:32 AM, Russell King - ARM Linux wrote:

On Mon, Jun 06, 2016 at 12:59:18PM +0100, Mark Rutland wrote:

I agree that whether or not devices are coherent in practice depends on
the kernel's configuration. The flip side, as you point out, is that
devices are coherent when a specific set of attributes are used.

i.e. that if you read dma-coherent as meaning "coherent iff Normal,
Inner Shareable, Inner WB Cacheable, Outer WB Cacheable", then
dma-coherent consistently describes the same thing, rather than
depending on the configuration of the OS.


I think there is a bit of miss-understanding with 'dma-coherent'
DT property and as RMK pointed out "dma-coherent-outer" isn't
right direction either.


DT is a datastructure provided to the kernel, potentially without deep
internal knowledge of that kernel configuration. Having a consistent
rule that is independent of the kernel configuration seems worth aiming
for.


I think you've missed the point.  dma-coherent is _already_ dependent on
the kernel configuration.  "Having a consistent rule that is independent
of the kernel configuration" is already an impossibility, as I illustrated
in my previous message concerning Marvell Armada SoCs, and you also said
in your preceding paragraph!

For example, if you clear the shared bit in the page tables on non-LPAE
SoCs, devices are no longer coherent.

DMA coherence on ARM _is_ already tightly linked with the kernel
configuration.  You already can't get away from that, so I think you
should give up trying to argue that point. :)

Whether devices are DMA coherent is a combination of two things:
 * is the device connected to a coherent bus.
 * is the system setup to allow coherency on that bus to work.

We capture the first through the dma-coherent property, which is clearly
a per-device property.  We ignore the second because we assume everyone
is going to configure the CPU side correctly.  That's untrue today, and
it's untrue not only because of Keystone II, but also because of other
SoCs as well which pre-date Keystone II.  We currently miss out on
considering that, because if we ignore it, we get something that works
for most platforms.


I agree with Russell. When I added "dma-coherent" per device DT
property, the intention was to distinguish certain devices which may
not be coherent sitting on coherent fabric for some hardware reasons.


I don't see that adding a dma-outer-coherent property helps this - it's
muddying the waters somewhat - and it's also forcing additional complexity
into places where we shouldn't have it.  We would need to parse two
properties in the DMA API code, and then combine it with knowledge as
to how the system page tables have been setup.  If they've been setup
as inner sharable, then dma-coherent identifies whether the device is
coherent.  If they've been setup as outer sharable, then
dma-outer-coherent specifies that and dma-coherent is meaningless.

Sounds like a recipe for confusion.


Exactly. We should leave the "dma-coherent" property to mark coherent
vs non coherent device(s).

The inner vs outer is really page table ARCH setup issue and should
be handled exactly the way it was done first place to handle the
special memory view(outside 4 GB).

Keystone needs outer shared bit set while setting up MMU pages
which is best done in MMU off mode while recreating the new
page tables.

Regards,
Santosh


Re: [RFC v2 4/4] ARM: keystone: dma-coherent with safe fallback

2016-06-06 Thread Santosh Shilimkar

On 6/6/2016 5:50 AM, William Mills wrote:




I saw only v2 but seems like it already generated
discussion(s)


On 06/06/2016 07:42 AM, Mark Rutland wrote:

On Mon, Jun 06, 2016 at 11:09:07AM +0200, Arnd Bergmann wrote:

On Monday, June 6, 2016 9:56:27 AM CEST Mark Rutland wrote:

[adding devicetree]

On Sun, Jun 05, 2016 at 11:20:29PM -0400, Bill Mills wrote:

Keystone2 can do DMA coherency but only if:
1) DDR3A DMA buffers are in high physical addresses (0x8__)
(DDR3B does not have this constraint)
2) Memory is marked outer shared
3) DMA Master marks transactions as outer shared
(This is taken care of in bootloader)

Use outer shared instead of inner shared.
This choice is done at early init time and uses the attr_mod facility

If the kernel is not configured for LPAE and using high PA, or if the
switch to outer shared fails, then we fail to meet this criteria.
Under any of these conditions we veto any dma-coherent attributes in
the DTB.


I very much do not like this. As I previously mentioned [1],
dma-coherent has de-facto semantics today. This series deliberately
changes that, and inverts the relationship between DT and kernel (as the
describption in the DT would now depend on the configuration of the
kernel).

I would prefer that we have a separate property (e.g.
"dma-outer-coherent") to describe when a device can be coherent with
Normal, Outer Shareable, Inner Write-Back, Outer Write-Back memory.
Then the kernel can figure out whether or not device can be used
coherently, depending on how it is configured.


I share your concern, but I don't think the dma-outer-coherent attribute
would be a good solution either.

The problem really is that keystone is a platform that is sometimes
coherent, depending purely on what kernel we run, and not at all on
anything we can describe in devicetree, and I don't see any good way
to capture the behavior of the hardware in generic DT bindings.


I think that above doesn't quite capture the situation:

Some DMA masters can be cache-coherent (only) with Outer Shareable
transactions. That is a property we could capture inthe DT (e.g.
dma-outer-coherent), and is independent of the kernel configuration.

Whether or not the devices are coherent with the kernel's chosen memory
attributes certainly depends on the kernel configuration, but that is
not what we capture in the DT.


So far, the assumption has been:

- when running a non-LPAE kernel, keystone is not coherent, and we
  must ignore both the dma-coherent properties in devices and the
  dma-ranges properties in bus nodes.


Correct.


I wasn't able to spot if/where that was enforced. Is it possible to boot
Keystone UP, !LPAE?



Yes ...  with the right combination of DTB, u-boot, u-boot vars, and
kernel config.  Mismatches either fail hard or use dma-coherent ops
without actually providing coherency. I am attempting to make this less
fragile.

Mis-configured coherency can be dead-wrong and still only fail 1
transaction in 1,000,000.  I have seen customers run for weeks or months
w/o detecting the issue.  Thats why I wanted the veto logic.

There are 3 cases to cover:
LPAE w/ high PA:
this is the normal mode for KS2.  Uses coherent dma-ops.
!LPAE:
obviously uses low PA and must use non-coherent dma-ops.
LPAE w/ low PA:
This happens with an LPAE kernel but the user has passed a low
PA memory DTB and u-boot has not fixed it up.
This case must also use non-coherent dma-ops

Upstream DTS has keystone memory at the low PA.  I agree with that.
U-boot and kernel opt-in to the use of high PA.

If you give high PA to a non-LPAE kernel I believe it will fail hard and
fast.  I can check.


UP will mostly boot from boot view the memory. The keystone_pv_fixup()
will bail out for higher PA. Let me know if you see otherwise.

Regards,
Santosh



Re: [PATCH 3/3] ARM: configs: keystone: Enable PINCTRL_SINGLE Config

2016-06-06 Thread Santosh Shilimkar

On 6/5/2016 9:56 PM, Keerthy wrote:



[...]


Santosh,

I posted a consolidated series for k2l.


Thanks. Will pick that up.


Franklin,

Could you re-post k2g series on top of the series i posted today.


Please do.


Re: [PATCH 0/5] ARM:Keystone: Add pinmuxing support

2016-06-03 Thread Santosh Shilimkar

Franklin,

On 6/3/2016 11:42 AM, Franklin Cooper Jr. wrote:

Gentle ping on this series

On 04/27/2016 09:11 AM, Franklin S Cooper Jr wrote:

Unlike most Keystone 2 devices, K2G supports pinmuxing of its pins. This
patch series enables pinmuxing for Keystone 2 devices.

Franklin S Cooper Jr (1):
  ARM: keystone: defconfig: Enable PINCTRL SINGLE for Keystone 2

Lokesh Vutla (3):
  ARM: Keystone: Enable PINCTRL for Keystone ARCH
  ARM: dts: keystone: Header file for pinctrl constants
  ARM: dts: k2g-evm: Add pinmuxing for UART0

Vitaly Andrianov (1):
  ARM: dts: k2g: Add pinctrl support


Can you please check if it needs to be refreshed
against v4.7-rc1 and if yes please re-post it.
I will apply it for 4.8

Regards,
Santosh


Re: [PATCH] ARM: Keystone: Introduce Kconfig option to compile in typical Keystone features

2016-06-02 Thread Santosh Shilimkar

On 6/2/2016 5:34 AM, Nishanth Menon wrote:

On 06/01/2016 06:26 PM, Santosh Shilimkar wrote:
[...]

Side note on LPAE:
For our current device tree and u-boot, LPAE is mandatory to bootup
for current Keystone boards - but this is not a SoC requirement,
booting without LPAE/HIGHMEM results in non-coherent DDR accesses.


This sounds like a regression, I thought we had this working when
keystone was initially merged and we got both the coherent and
non-coherent mode working with the same DT.


Yes and it works. The coherent memory space itself is beyond 4GB so


Hmm... True, I just tested next-20160602 with mem_lpae set to 0 in
u-boot and it seems to boot just fine.


I don't understand a requirement of having coherent memory without
LPAE.


Looks like a messed up description on my end, Looks like I have to
update my automated test framework to incorporate the manual steps
involved.


No worries. Am glad you got your setup working.

Regards,
Santosh



Re: [PATCH v1 1/2] ARM: dts: keystone: remove bogus IO resource entry from PCI binding

2016-06-02 Thread Santosh Shilimkar

On 6/2/2016 8:17 AM, Murali Karicheri wrote:

The PCI DT bindings contain a bogus entry for IO space which is not
supported on Keystone. The current bogus entry has an invalid size
and throws following error during boot.

[0.420713] keystone-pcie 21021000.pcie: error -22: failed to map
   resource [io  0x-0x40003fff]

So remove it from the dts. While at it also add a bus-range
value that eliminates following log at boot up.

[0.420659] No bus range found for /soc/pcie@2102, using [bus 00-ff]

Signed-off-by: Murali Karicheri 
---

Both 1/2 and 2/2 looks fine to me. Will queue them for
next merge window.

Regards,
Santosh


Re: [PATCH] rds: fix an infoleak in rds_inc_info_copy

2016-06-02 Thread Santosh Shilimkar

On 6/2/2016 1:11 AM, Kangjie Lu wrote:

The last field "flags" of object "minfo" is not initialized.
Copying this object out may leak kernel stack data.
Assign 0 to it to avoid leak.

Signed-off-by: Kangjie Lu 
---
 net/rds/recv.c | 2 ++
 1 file changed, 2 insertions(+)


Acked-by: Santosh Shilimkar 


Re: [PATCH] ARM: Keystone: Introduce Kconfig option to compile in typical Keystone features

2016-06-01 Thread Santosh Shilimkar

On 6/1/2016 3:49 PM, Nishanth Menon wrote:

On 06/01/2016 05:31 PM, Arnd Bergmann wrote:


[...]



Santosh, Bill, Lokesh, Grygorii: could you help feedback on the above
comments from Arnd?


Already responded to Arnds email.


Re: [PATCH] ARM: Keystone: Introduce Kconfig option to compile in typical Keystone features

2016-06-01 Thread Santosh Shilimkar

On 6/1/2016 3:31 PM, Arnd Bergmann wrote:

On Wednesday, June 1, 2016 4:31:54 PM CEST Nishanth Menon wrote:

Introduce ARCH_KEYSTONE_TYPICAL which is common for all Keystone
platforms. This is particularly useful when custom optimized defconfig
builds are created for Keystone architecture platforms.

An example of the same would be a sample fragment ks_only.cfg:
http://pastebin.ubuntu.com/16904991/ - This prunes all arch other than
keystone and any options the other architectures may enable.

git clean -fdx && git reset --hard && \
 ./scripts/kconfig/merge_config.sh -m \
./arch/arm/configs/multi_v7_defconfig ~/tmp/ks_only.cfg &&\
 make olddefconfig

The above unfortunately will disable options necessary for KS2 boards
to boot to the bare minimum initramfs.

Hence the "KEYSTONE_TYPICAL" option is designed similar to commit 8d9166b519fd
("omap2/3/4: Add Kconfig option to compile in typical omap features")
that can be enabled for most boards keystone platforms
without needing to rediscover these in defconfig all over again -
examples include multi_v7_defconfig base and optimizations done on top
of them for keystone platform.


I'd rather remove the option for OMAP as well, it doesn't really fit in with
how we do things for other platforms, and selecting a lot of other Kconfig
symbols tends to cause circular dependencies.


Yes.



NOTE: the alternative is to select the configurations under
ARCH_KEYSTONE. However, that would fail multi_v7 builds on ARM
variants that dont work with LPAE.


Please no arbitrary selects from the platform.


Cc: Bill Mills 
Cc: Murali Karicheri 
Cc: Grygorii Strashko 
Cc: Tero Kristo 
Cc: Lokesh Vutla 
Signed-off-by: Nishanth Menon 
---

Based on: next-20160601

Tested for basic initramfs boot for K2HK/K2G platforms with the
http://pastebin.ubuntu.com/16904991/ fragment + multi_v7_defconfig

Side note on LPAE:
For our current device tree and u-boot, LPAE is mandatory to bootup
for current Keystone boards - but this is not a SoC requirement,
booting without LPAE/HIGHMEM results in non-coherent DDR accesses.


This sounds like a regression, I thought we had this working when
keystone was initially merged and we got both the coherent and
non-coherent mode working with the same DT.


Yes and it works. The coherent memory space itself is beyond 4GB so
I don't understand a requirement of having coherent memory without
LPAE.


Currently:
- U-Boot assumes that lpae is always enabled in kennel and updates the
DT memory node with higher addresses. Because of which you are not
detecting any memory without lpae and kernel crashed very early, hence
no prints. So, make mem_lpae env setting as 0 in U-boot.


We could work around this in the kernel by detecting the faulty u-boot
behavior and fixing up the addresses in an early platform callback.


U-boot is already doing that and I don't see any issue with it.


- DT also assumes that lpae is always enabled, and always asks for
dma-address translation for higher addresses to lower addresses.
Just delete the "dma-ranges" property or create a one-on-one mapping
like dma-ranges = <0x8000 0x0 0x8000 0x8000>


This may be a bit trickier, I think originally keystone ignored the
dma-ranges property and hacked up its own offset by adding a magic
constant to the dma address using a bus notifier. We probably don't
want to bring that hack back, but maybe we can come up with another
solution.


I don't think we should go on this path ever. U-boot should modify
this parameter as done previously.

Regards,
Santosh


Re: [PATCH 3/3] ARM: configs: keystone: Enable PINCTRL_SINGLE Config

2016-05-24 Thread Santosh Shilimkar

Hi Keerthy,

On 5/23/2016 8:56 PM, Keerthy wrote:



On Tuesday 24 May 2016 09:07 AM, Lokesh Vutla wrote:



On Monday 23 May 2016 05:59 PM, Keerthy wrote:

keystone-k2l devices use pinmux and are compliant with PINCTRL_SINGLE.
Hence enable the config option.

Signed-off-by: Keerthy 


A similar patch[1] is already posted.

[1]https://patchwork.kernel.org/patch/8958091/


Ah I had not seen them. If they are already reviewed and closer to be
merged then Patch 2 and Patch 3 of this series can be dropped.


Once the 4.7-rc2 is out, please rebase these floating patches
against it and post a consolidated patches. I will line them
up for 4.8

Regards,
Santosh


Re: [PATCH 46/54] MAINTAINERS: Add file patterns for ti device tree bindings

2016-05-22 Thread Santosh Shilimkar

On 5/22/2016 2:06 AM, Geert Uytterhoeven wrote:

Submitters of device tree binding documentation may forget to CC
the subsystem maintainer if this is missing.

Signed-off-by: Geert Uytterhoeven 
Cc: Santosh Shilimkar 
Cc: linux-kernel@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
---
Please apply this patch directly if you want to be involved in device
tree binding documentation for your subsystem.
---

Acked-by: Santosh Shilimkar 


Re: [rcu_sched stall] regression/miss-config ?

2016-05-19 Thread Santosh Shilimkar

Hi Paul,

On 5/17/2016 12:15 PM, Paul E. McKenney wrote:

On Tue, May 17, 2016 at 06:46:22AM -0700, santosh.shilim...@oracle.com wrote:

On 5/16/16 5:58 PM, Paul E. McKenney wrote:

On Mon, May 16, 2016 at 12:49:41PM -0700, Santosh Shilimkar wrote:

On 5/16/2016 10:34 AM, Paul E. McKenney wrote:

On Mon, May 16, 2016 at 09:33:57AM -0700, Santosh Shilimkar wrote:


[...]


Are you running CONFIG_NO_HZ_FULL=y?  If so, the problem might be that
you need more housekeeping CPUs than you currently have configured.


Yes, CONFIG_NO_HZ_FULL=y. Do you mean "CONFIG_NO_HZ_FULL_ALL=y" for
book keeping. Seems like without that clock-event code will just use
CPU0 for things like broadcasting which might become bottleneck.
This could explain connect the hrtimer_interrupt() path getting slowed
down because of book keeping bottleneck.

$cat .config | grep NO_HZ
CONFIG_NO_HZ_COMMON=y
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ_FULL=y
# CONFIG_NO_HZ_FULL_ALL is not set
# CONFIG_NO_HZ_FULL_SYSIDLE is not set
CONFIG_NO_HZ=y
# CONFIG_RCU_FAST_NO_HZ is not set


Yes, CONFIG_NO_HZ_FULL_ALL=y would give you only one CPU for all
housekeeping tasks, including the RCU grace-period kthreads.  So you are
booting without any nohz_full boot parameter?  You can end up with the
same problem with CONFIG_NO_HZ_FULL=y and the nohz_full boot parameter
that you can with CONFIG_NO_HZ_FULL_ALL=y.


I see. Yes, the systems are booting without nohz_full boot parameter.
Will try to add more CPUs to it & update the thread
after the verification since it takes time to reproduce the issue.

Thanks for discussion so far Paul. Its very insightful for me.


Please let me know how things go with further testing, especially with
the priority setting.


Sorry for delay. I manage to get information about XEN usecase
custom config as discussed above. To reduce variables, I disabled 
"CONFIG_NO_HZ_FULL" altogether. So the effective setting was:


CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_TREE_RCU_TRACE=y
CONFIG_RCU_KTHREAD_PRIO=1
CONFIG_RCU_CPU_STALL_TIMEOUT=21
CONFIG_RCU_TRACE=y

Unfortunately the XEN test still failed. Log end of
the email. This test(s) is bit peculiar though since
its database running in VM with 1 or 2 CPUs. One of
the suspect is because the database RT processes are
hogging the CPU(s), kernel RCU thread is not getting chance
to run which eventually results in stall. Does it
make sense ?

Please note that its non-preempt kernel using RT processes. ;-)

# cat .config | grep PREEMPT
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set

Regards,
Santosh
...

rcu_sched kthread starved for 399032 jiffies!
INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 0, 
t=462037 jiffies, g=11, c=118887, q=0)
All QSes seen, last rcu_sched kthread activity 462037 
(4296277632-4295815595), jiffies_till_next_fqs=3, root ->qsmask 0x0

ocssd.bin   R  running task0 15375  1 0x
  8800ec003bc8 810a8581 81abf980
 0001d068 8800ec003c28 810e9c98 
 0086  0086 0082
Call Trace:
   [] sched_show_task+0xb1/0x120
 [] print_other_cpu_stall+0x288/0x2d0
 [] __rcu_pending+0x180/0x230
 [] rcu_check_callbacks+0x95/0x140
 [] update_process_times+0x42/0x70
 [] tick_sched_handle+0x39/0x80
 [] tick_sched_timer+0x52/0xa0
 [] __run_hrtimer+0x74/0x1d0
 [] ? tick_nohz_handler+0xc0/0xc0
 [] hrtimer_interrupt+0x102/0x240
 [] xen_timer_interrupt+0x2e/0x130
 [] ? add_interrupt_randomness+0x3a/0x1f0
 [] ? store_cursor_blink+0xc0/0xc0
 [] handle_irq_event_percpu+0x54/0x1b0
 [] handle_percpu_irq+0x47/0x70
 [] generic_handle_irq+0x27/0x40
 [] evtchn_2l_handle_events+0x25a/0x260
 [] ? __do_softirq+0x191/0x2f0
 [] __xen_evtchn_do_upcall+0x4f/0x90
 [] xen_evtchn_do_upcall+0x34/0x50
 [] xen_hvm_callback_vector+0x6e/0x80
 
rcu_sched kthread starved for 462037 jiffies!




Re: [rcu_sched stall] regression/miss-config ?

2016-05-16 Thread Santosh Shilimkar

On 5/16/2016 10:34 AM, Paul E. McKenney wrote:

On Mon, May 16, 2016 at 09:33:57AM -0700, Santosh Shilimkar wrote:

On 5/16/2016 5:03 AM, Paul E. McKenney wrote:

On Sun, May 15, 2016 at 09:35:40PM -0700, santosh.shilim...@oracle.com wrote:

On 5/15/16 2:18 PM, Santosh Shilimkar wrote:

Hi Paul,

I was asking Sasha about [1] since other folks in Oracle
also stumbled upon similar RCU stalls with v4.1 kernel in
different workloads. I was reported similar issue with
RDS as well and looking at [1], [2], [3] and [4], thought
of reaching out to see if you can help us to understand
this issue better.

Have also included RCU specific config used in these
test(s). Its very hard to reproduce the issue but one of
the data point is, it reproduces on systems with larger
CPUs(64+). Same workload with less than 64 CPUs, don't
show the issue. Someone also told me, making use of
SLAB instead SLUB allocator makes difference but I
haven't verified that part for RDS.

Let me know your thoughts. Thanks in advance !!


One of my colleague told me the pastebin server I used
is Oracle internal only so adding the relevant logs along
with email.



[...]


[1] https://lkml.org/lkml/2014/12/14/304



[2]  Log 1 snippet:
-
INFO: rcu_sched self-detected stall on CPU
INFO: rcu_sched self-detected stall on CPU { 54}  (t=6 jiffies
g=66023 c=66022 q=0)
Task dump for CPU 54:
ksoftirqd/54R  running task0   389  2 0x0008
 0007 88ff7f403d38 810a8621 0036
 81ab6540 88ff7f403d58 810a86cf 0086
 81ab6940 88ff7f403d88 810e3ad3 81ab6540
Call Trace:
   [] sched_show_task+0xb1/0x120
 [] dump_cpu_task+0x3f/0x50
 [] rcu_dump_cpu_stacks+0x83/0xc0
 [] print_cpu_stall+0xfc/0x170
 [] __rcu_pending+0x2bb/0x2c0
 [] rcu_check_callbacks+0x9d/0x170
 [] update_process_times+0x42/0x70
 [] tick_sched_handle+0x39/0x80
 [] tick_sched_timer+0x44/0x80
 [] __run_hrtimer+0x74/0x1d0
 [] ? tick_nohz_handler+0xa0/0xa0
 [] hrtimer_interrupt+0x102/0x240
 [] local_apic_timer_interrupt+0x39/0x60
 [] smp_apic_timer_interrupt+0x45/0x59
 [] apic_timer_interrupt+0x6e/0x80
   [] ? free_one_page+0x164/0x380
 [] ? __free_pages_ok+0xc3/0xe0
 [] __free_pages+0x25/0x40
 [] rds_message_purge+0x60/0x150 [rds]
 [] rds_message_put+0x44/0x80 [rds]
 [] rds_ib_send_cqe_handler+0x134/0x2d0 [rds_rdma]
 [] ? _raw_spin_unlock_irqrestore+0x1b/0x50
 [] ? mlx4_ib_poll_cq+0xb3/0x2a0 [mlx4_ib]
 [] poll_cq+0xa1/0xe0 [rds_rdma]
 [] rds_ib_tasklet_fn_send+0x79/0xf0 [rds_rdma]


The most likely possibility is that there is a 60-second-long loop in
one of the above functions.  This is within bottom-half execution, so
unfortunately the usual trick of placing cond_resched_rcu_qs() within this
loop, but outside of any RCU read-side critical section does not work.


First of all thanks for explanation.

There is no loop which can last for 60 seconds in above code since
its just completion queue handler used to free up buffers much like
NIC
drivers bottom half(NAPI). Its done in tasklet context for latency
reasons which RDS care most. Just to get your attention, the RCU
stall is also seen with XEN code too. Log for it end of the email.

Another important observation is, for RDS if we avoid higher
order page(s) allocation, issue is not reproducible so far.
In other words, for PAGE_SIZE(4K, get_order(bytes) ==0) allocations,
the system continues to run without any issue, so the loop scenario
is ruled out more or less.

To be specific, with PAGE_SIZE allocations, alloc_pages()
is just allocating a page and __free_page() is used
instead of __free_pages() from below snippet.

--
if (bytes >= PAGE_SIZE)
page = alloc_pages(gfp, get_order(bytes));

.

(rm->data.op_sg[i].length <= PAGE_SIZE) ?
__free_page(sg_page(&rm->data.op_sg[i])) :
__free_pages(sg_page(&rm->data.op_sg[i]),
get_order(rm->data.op_sg[i].length));



This sounds like something to take up with the mm folks.


Sure. Will do once the link between two issues is established.


Therefore, if there really is a loop here, one fix would be to
periodically unwind back out to run_ksoftirqd(), but setting up so that
the work would be continued later.  Another fix might be to move this

>from tasklet context to workqueue context, where cond_resched_rcu_qs()

can be used -- however, this looks a bit like networking code, which
does not always take kindly to being run in process context (though
careful use of local_bh_disable() and local_bh_enable() can sometimes
overcome this issue).  A third fix, which works only if this code does
not use RCU and does not invoke any code that does use RCU, is to tell
RCU that it should ignore this code (which will require a little work
on RCU, as it currently does not tolerate this sort of thing aside from
the idle threads).  In this last approach, event-tra

Re: [rcu_sched stall] regression/miss-config ?

2016-05-16 Thread Santosh Shilimkar

On 5/16/2016 5:03 AM, Paul E. McKenney wrote:

On Sun, May 15, 2016 at 09:35:40PM -0700, santosh.shilim...@oracle.com wrote:

On 5/15/16 2:18 PM, Santosh Shilimkar wrote:

Hi Paul,

I was asking Sasha about [1] since other folks in Oracle
also stumbled upon similar RCU stalls with v4.1 kernel in
different workloads. I was reported similar issue with
RDS as well and looking at [1], [2], [3] and [4], thought
of reaching out to see if you can help us to understand
this issue better.

Have also included RCU specific config used in these
test(s). Its very hard to reproduce the issue but one of
the data point is, it reproduces on systems with larger
CPUs(64+). Same workload with less than 64 CPUs, don't
show the issue. Someone also told me, making use of
SLAB instead SLUB allocator makes difference but I
haven't verified that part for RDS.

Let me know your thoughts. Thanks in advance !!


One of my colleague told me the pastebin server I used
is Oracle internal only so adding the relevant logs along
with email.



[...]


[1] https://lkml.org/lkml/2014/12/14/304



[2]  Log 1 snippet:
-
 INFO: rcu_sched self-detected stall on CPU
 INFO: rcu_sched self-detected stall on CPU { 54}  (t=6 jiffies
g=66023 c=66022 q=0)
 Task dump for CPU 54:
 ksoftirqd/54R  running task0   389  2 0x0008
  0007 88ff7f403d38 810a8621 0036
  81ab6540 88ff7f403d58 810a86cf 0086
  81ab6940 88ff7f403d88 810e3ad3 81ab6540
 Call Trace:
[] sched_show_task+0xb1/0x120
  [] dump_cpu_task+0x3f/0x50
  [] rcu_dump_cpu_stacks+0x83/0xc0
  [] print_cpu_stall+0xfc/0x170
  [] __rcu_pending+0x2bb/0x2c0
  [] rcu_check_callbacks+0x9d/0x170
  [] update_process_times+0x42/0x70
  [] tick_sched_handle+0x39/0x80
  [] tick_sched_timer+0x44/0x80
  [] __run_hrtimer+0x74/0x1d0
  [] ? tick_nohz_handler+0xa0/0xa0
  [] hrtimer_interrupt+0x102/0x240
  [] local_apic_timer_interrupt+0x39/0x60
  [] smp_apic_timer_interrupt+0x45/0x59
  [] apic_timer_interrupt+0x6e/0x80
[] ? free_one_page+0x164/0x380
  [] ? __free_pages_ok+0xc3/0xe0
  [] __free_pages+0x25/0x40
  [] rds_message_purge+0x60/0x150 [rds]
  [] rds_message_put+0x44/0x80 [rds]
  [] rds_ib_send_cqe_handler+0x134/0x2d0 [rds_rdma]
  [] ? _raw_spin_unlock_irqrestore+0x1b/0x50
  [] ? mlx4_ib_poll_cq+0xb3/0x2a0 [mlx4_ib]
  [] poll_cq+0xa1/0xe0 [rds_rdma]
  [] rds_ib_tasklet_fn_send+0x79/0xf0 [rds_rdma]


The most likely possibility is that there is a 60-second-long loop in
one of the above functions.  This is within bottom-half execution, so
unfortunately the usual trick of placing cond_resched_rcu_qs() within this
loop, but outside of any RCU read-side critical section does not work.


First of all thanks for explanation.

There is no loop which can last for 60 seconds in above code since its 
just completion queue handler used to free up buffers much like NIC

drivers bottom half(NAPI). Its done in tasklet context for latency
reasons which RDS care most. Just to get your attention, the RCU
stall is also seen with XEN code too. Log for it end of the email.

Another important observation is, for RDS if we avoid higher
order page(s) allocation, issue is not reproducible so far.
In other words, for PAGE_SIZE(4K, get_order(bytes) ==0) allocations,
the system continues to run without any issue, so the loop scenario
is ruled out more or less.

To be specific, with PAGE_SIZE allocations, alloc_pages()
is just allocating a page and __free_page() is used
instead of __free_pages() from below snippet.

--
if (bytes >= PAGE_SIZE)
page = alloc_pages(gfp, get_order(bytes));

.

(rm->data.op_sg[i].length <= PAGE_SIZE) ?
__free_page(sg_page(&rm->data.op_sg[i])) :
__free_pages(sg_page(&rm->data.op_sg[i]), 
get_order(rm->data.op_sg[i].length));





Therefore, if there really is a loop here, one fix would be to
periodically unwind back out to run_ksoftirqd(), but setting up so that
the work would be continued later.  Another fix might be to move this
from tasklet context to workqueue context, where cond_resched_rcu_qs()
can be used -- however, this looks a bit like networking code, which
does not always take kindly to being run in process context (though
careful use of local_bh_disable() and local_bh_enable() can sometimes
overcome this issue).  A third fix, which works only if this code does
not use RCU and does not invoke any code that does use RCU, is to tell
RCU that it should ignore this code (which will require a little work
on RCU, as it currently does not tolerate this sort of thing aside from
the idle threads).  In this last approach, event-tracing calls must use
the _nonidle suffix.

I am not familiar with the RDS code, so I cannot be more specific.


No worries. Since we saw the issue with XEN too, I was suspecting
that somehow we didn't hav

[rcu_sched stall] regression/miss-config ?

2016-05-15 Thread Santosh Shilimkar

Hi Paul,

I was asking Sasha about [1] since other folks in Oracle
also stumbled upon similar RCU stalls with v4.1 kernel in
different workloads. I was reported similar issue with
RDS as well and looking at [1], [2], [3] and [4], thought
of reaching out to see if you can help us to understand
this issue better.

Have also included RCU specific config used in these
test(s). Its very hard to reproduce the issue but one of
the data point is, it reproduces on systems with larger
CPUs(64+). Same workload with less than 64 CPUs, don't
show the issue. Someone also told me, making use of
SLAB instead SLUB allocator makes difference but I
haven't verified that part for RDS.

Let me know your thoughts. Thanks in advance !!

Regards,
Santosh

[1] https://lkml.org/lkml/2014/12/14/304
[2] log 1: http://pastebin.uk.oracle.com/iUr9qE
[3] log 2: http://pastebin.uk.oracle.com/Oe3cr5
[4] log 3: http://pastebin.uk.oracle.com/bMYLkD
[5] rcu config: http://pastebin.uk.oracle.com/e7NXTW


Re: [PATCH] gpio: omap: fix irq triggering in smart-idle wakeup mode

2016-04-18 Thread santosh shilimkar

On 4/18/2016 4:36 PM, Tony Lindgren wrote:

* Grygorii Strashko  [160418 08:59]:

On 04/15/2016 09:54 PM, Tony Lindgren wrote:

* santosh shilimkar  [160415 08:22]:

On 4/15/2016 2:26 AM, Grygorii Strashko wrote:


Santosh, Tony, do you want me to perform any additional actions regarding this 
patch?


This patch should be run across family of SOCs to make
sure wakeup works on all of those if not done already


Also, I'm not sure if we can just drop this code in question.

After this patch, what function updates the GPIO wkup_en registers
depending on enable_irq_wake()/disable_irq_wake()?



The main purpose of this patch is to *not* modify GPIO wkup_en registers
depending of enable_irq_wake()/disable_irq_wake() :), instead all
non wake up IRQs should be masked during suspend.


OK that makes sense.


The GPIO wkup_en registers should be always in sync with GPIO irq_en when
GPIO IP is in smart-idle wakeup mode. And this is done now from
omap_gpio_unmask_irq/omap_gpio_mask_irq(). See also [1].

In general, it is more or less similar to GIC + wakeupgen:
- during normal work (including cpuidle) GIC irq_en and Wakeupgen wkup_en
should be in sync always
- during suspend - only IRQs, marked as wake up sources, should be left
unmasked.

Also, I've found old thread [2] where Santosh proposed to use 
IRQCHIP_MASK_ON_SUSPEND.
And it was not possible, at that time, but now IRQCHIP_MASK_ON_SUSPEND can be 
used :),
because OMAP GPIO driver was switched to use generic irq handler instead of 
chained, so
now OMAP GPIO irqs are properly handled by IRQ PM core.
[chained irqs (and chained irq handles) are not disabled during suspend/resume 
and they are
  not maintained by IRQ PM core as result they can trigger way too early on 
resume when
  OMAP GPIO is not ready/powered.]


OK. For my tests this patch does not change anything. I noticed however
that we still have some additional bug somewhere where GPIO wake up
events work fine for omap3 PM runtime, but are flakey for suspend.


I've tested it on: am57x-evm, am437x-idk-evm, omap4-panda


OK thanks! Based on my tests and the above:

Acked-by: Tony Lindgren 


If all works then consider my ack as well :-)


Re: [net][PATCH v2 0/2] RDS: couple of fixes for 4.6

2016-04-16 Thread santosh shilimkar

On 4/16/2016 3:53 PM, David Miller wrote:

From: Santosh Shilimkar 
Date: Thu, 14 Apr 2016 10:43:25 -0700


   git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git 
for_4.6/net/rds-fixes


I have no idea how you set this up, but there is no WAY this can be
pulled from by me.


Thought I did based it against 'net' after your last comment.
Just checked again and the 'net' remote added by me points
to wrong url(net-next).


When I try to pull it into 'net' I get 2690 objects.  That means you
didn't base it upon the 'net' tree which you must do.  You can't base
it upon Linus's tree, because if you do I'll get a ton of changes that
are absolutely not appropriate to be pulled into my 'net' tree.

Are you always doing this?  Working against Linus's tree instead of
mine?


No, its not Linus's tree. Its yours but not the right one.
Sorry for the trouble. Won't happen again.

Thanks for picking up the matches from patchworks.

Regards,
Santosh


Re: [PATCH] gpio: omap: fix irq triggering in smart-idle wakeup mode

2016-04-15 Thread santosh shilimkar

On 4/15/2016 2:26 AM, Grygorii Strashko wrote:

On 04/15/2016 11:32 AM, Linus Walleij wrote:

On Tue, Apr 12, 2016 at 12:52 PM, Grygorii Strashko
 wrote:


Now GPIO IRQ loss is observed on dra7-evm after suspend/resume cycle

(...)

Cc: Roger Quadros 
Signed-off-by: Grygorii Strashko 


Can I get some explicit ACK / Tested-by tags for this patch?


Roger's promised to test it once suspend regression will be fixed
for dra7-evm, probably next rc.



Is it a serious regression that will need to go in as a fix and
tagged for stable?



This issue is here since 2012, so I think it's not very critical -
It seems bits combination which causing the issue is rare.

Regarding stable:
4.4 - good to have, simple merge conflict
4.1 - some merge resolution is required
older kernel - it will be hard to backport it due to significant
changes in omap gpio driver

Santosh, Tony, do you want me to perform any additional actions regarding this 
patch?


This patch should be run across family of SOCs to make
sure wakeup works on all of those if not done already


[net][PATCH v2 2/2] RDS: Fix the atomicity for congestion map update

2016-04-14 Thread Santosh Shilimkar
Two different threads with different rds sockets may be in
rds_recv_rcvbuf_delta() via receive path. If their ports
both map to the same word in the congestion map, then
using non-atomic ops to update it could cause the map to
be incorrect. Lets use atomics to avoid such an issue.

Full credit to Wengang  for
finding the issue, analysing it and also pointing out
to offending code with spin lock based fix.

Reviewed-by: Leon Romanovsky 
Signed-off-by: Wengang Wang 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/cong.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rds/cong.c b/net/rds/cong.c
index e6144b8..6641bcf 100644
--- a/net/rds/cong.c
+++ b/net/rds/cong.c
@@ -299,7 +299,7 @@ void rds_cong_set_bit(struct rds_cong_map *map, __be16 port)
i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;
 
-   __set_bit_le(off, (void *)map->m_page_addrs[i]);
+   set_bit_le(off, (void *)map->m_page_addrs[i]);
 }
 
 void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port)
@@ -313,7 +313,7 @@ void rds_cong_clear_bit(struct rds_cong_map *map, __be16 
port)
i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;
 
-   __clear_bit_le(off, (void *)map->m_page_addrs[i]);
+   clear_bit_le(off, (void *)map->m_page_addrs[i]);
 }
 
 static int rds_cong_test_bit(struct rds_cong_map *map, __be16 port)
-- 
1.9.1



[net][PATCH v2 0/2] RDS: couple of fixes for 4.6

2016-04-14 Thread Santosh Shilimkar
v2:
Rebased fixes against 'net' instead of 'net-next' Patches are also
available at below git tree.

The following changes since commit e013b7780c41b471c4269ac9ccafb65ba7c9ec86:

  Merge branch 'dsa-voidify-ops' (2016-04-08 16:51:15 -0400)

are available in the git repository at:


  git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git 
for_4.6/net/rds-fixes

for you to fetch changes up to e9155afb1902380938ca83ba8504aaa2d7ee5210:

  RDS: Fix the atomicity for congestion map update (2016-04-08 15:08:13 -0700)


Qing Huang (1):
  RDS: fix endianness for dp_ack_seq

Santosh Shilimkar (1):
  RDS: Fix the atomicity for congestion map update

 net/rds/cong.c  | 4 ++--
 net/rds/ib_cm.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

Regards,
Santosh


[net][PATCH v2 1/2] RDS: fix endianness for dp_ack_seq

2016-04-14 Thread Santosh Shilimkar
From: Qing Huang 

dp->dp_ack_seq is used in big endian format. We need to do the
big endianness conversion when we assign a value in host format
to it.

Signed-off-by: Qing Huang 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_cm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 8764970..310cabc 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -194,7 +194,7 @@ static void rds_ib_cm_fill_conn_param(struct rds_connection 
*conn,
dp->dp_protocol_major = RDS_PROTOCOL_MAJOR(protocol_version);
dp->dp_protocol_minor = RDS_PROTOCOL_MINOR(protocol_version);
dp->dp_protocol_minor_mask = 
cpu_to_be16(RDS_IB_SUPPORTED_PROTOCOLS);
-   dp->dp_ack_seq = rds_ib_piggyb_ack(ic);
+   dp->dp_ack_seq = cpu_to_be64(rds_ib_piggyb_ack(ic));
 
/* Advertise flow control */
if (ic->i_flowctl) {
-- 
1.9.1



Re: [PATCH] ARM: dts: keystone: Add aliases for SPI nodes

2016-04-13 Thread santosh shilimkar

On 4/13/2016 3:52 AM, Vignesh R wrote:

Add aliases for SPI nodes, this is required to probe the SPI devices in
U-Boot.

Signed-off-by: Vignesh R 
---

Applied. Thanks !!


Re: [PATCH] gpio: omap: fix irq triggering in smart-idle wakeup mode

2016-04-12 Thread santosh shilimkar

On 4/12/2016 3:52 AM, Grygorii Strashko wrote:

Now GPIO IRQ loss is observed on dra7-evm after suspend/resume cycle
in the following case:
   extcon_usb1(id_irq) ->  pcf8575.gpio1 -> omapgpio6.gpio11 -> gic

the extcon_usb1 is wake up source and it enables IRQ wake up for
id_irq by calling enable/disable_irq_wake() during suspend/resume
which, in turn, causes execution of omap_gpio_wake_enable(). And
omap_gpio_wake_enable() will set/clear corresponding bit in
GPIO_IRQWAKEN_x register.

omapgpio6 configuration after boot - wakeup is enabled for GPIO IRQs
by default from  omap_gpio_irq_type:
GPIO_IRQSTATUS_SET_0| 0x0400
GPIO_IRQSTATUS_CLR_0| 0x0400
GPIO_IRQWAKEN_0 | 0x0400
GPIO_RISINGDETECT   | 0x
GPIO_FALLINGDETECT  | 0x0400

omapgpio6 configuration after after suspend/resume cycle:
GPIO_IRQSTATUS_SET_0| 0x0400
GPIO_IRQSTATUS_CLR_0| 0x0400
GPIO_IRQWAKEN_0 | 0x <---
GPIO_RISINGDETECT   | 0x
GPIO_FALLINGDETECT  | 0x0400

As result, system will start to lose interrupts from pcf8575 GPIO
expander, because when OMAP GPIO IP is in smart-idle wakeup mode, there
is no guarantee that transition(s) on input non wake up GPIO pin will
trigger asynchronous wake-up request to PRCM and then IRQ generation.
IRQ will be generated when GPIO is in active mode - for example, some
time after accessing GPIO bank registers IRQs will be generated
normally, but issue will happen again once PRCM will put GPIO in low
power smart-idle wakeup mode.

Note 1. Issue is not reproduced if debounce clk is enabled for GPIO
bank.

Note 2. Issue hardly reproducible if GPIO pins group contains both
wakeup/non-wakeup gpios - for example, it will be hard to reproduce
issue with pin2 if GPIO_IRQWAKEN_0=0x1 GPIO_IRQSTATUS_SET_0=0x3
GPIO_FALLINGDETECT = 0x3 (TRM "Power Saving by Grouping the Edge/Level
Detection").

Note 3. There nothing common bitween System wake up and OMAP GPIO bank
IP wake up logic - the last one defines how the GPIO bank ON-IDLE-ON
transition will happen inside SoC under control of PRCM.

Hence, fix the problem by removing omap_set_gpio_wakeup() function
completely and so keeping always in sync GPIO IRQ mask/unmask
(IRQSTATUS_SET) and wake up enable (GPIO_IRQWAKEN) bits; and adding
IRQCHIP_MASK_ON_SUSPEND flag in OMAP GPIO irqchip. That way non wakeup
GPIO IRQs will be properly masked/unmask by IRQ PM core during
suspend/resume cycle.

Cc: Roger Quadros 
Signed-off-by: Grygorii Strashko 
---

GPIO IP has two levels of controls for wakeups and you are
just removing the SYSCFG wakeup and relying on the IRQ
line wakeup. I like usage of "IRQCHIP_MASK_ON_SUSPEND" but
please be acreful this change which might break older OMAPs.


  drivers/gpio/gpio-omap.c | 42 ++
  1 file changed, 2 insertions(+), 40 deletions(-)

diff --git a/drivers/gpio/gpio-omap.c b/drivers/gpio/gpio-omap.c
index 551dfa9..b98ede7 100644
--- a/drivers/gpio/gpio-omap.c
+++ b/drivers/gpio/gpio-omap.c
@@ -611,51 +611,12 @@ static inline void omap_set_gpio_irqenable(struct 
gpio_bank *bank,
omap_disable_gpio_irqbank(bank, BIT(offset));
  }

-/*
- * Note that ENAWAKEUP needs to be enabled in GPIO_SYSCONFIG register.
- * 1510 does not seem to have a wake-up register. If JTAG is connected
- * to the target, system will wake up always on GPIO events. While
- * system is running all registered GPIO interrupts need to have wake-up
- * enabled. When system is suspended, only selected GPIO interrupts need
- * to have wake-up enabled.
- */
-static int omap_set_gpio_wakeup(struct gpio_bank *bank, unsigned offset,
-   int enable)
-{
-   u32 gpio_bit = BIT(offset);
-   unsigned long flags;
-
-   if (bank->non_wakeup_gpios & gpio_bit) {
-   dev_err(bank->chip.parent,
-   "Unable to modify wakeup on non-wakeup GPIO%d\n",
-   offset);
-   return -EINVAL;
-   }
-
-   raw_spin_lock_irqsave(&bank->lock, flags);
-   if (enable)
-   bank->context.wake_en |= gpio_bit;
-   else
-   bank->context.wake_en &= ~gpio_bit;
-
-   writel_relaxed(bank->context.wake_en, bank->base + bank->regs->wkup_en);
-   raw_spin_unlock_irqrestore(&bank->lock, flags);
-
-   return 0;
-}
-
  /* Use disable_irq_wake() and enable_irq_wake() functions from drivers */
  static int omap_gpio_wake_enable(struct irq_data *d, unsigned int enable)
  {
struct gpio_bank *bank = omap_irq_data_get_bank(d);
-   unsigned offset = d->hwirq;
-   int ret;

-   ret = omap_set_gpio_wakeup(bank, offset, enable);
-   if (!ret)
-   ret = irq_set_irq_wake(bank->irq, enable);
-
-   return ret;
+   return irq_set_irq_wake(bank->irq, enable);
  }

  static int omap_gpio_request(struct gpio_chip *chip, unsigned offset)
@@ -1187,6 +1148,7 @@ static int omap_gpio_probe(struct platform_device

[net-next][PATCH 0/2] RDS: couple of fixes for 4.6

2016-04-08 Thread Santosh Shilimkar
Patches are also available at below git tree. 

git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git 
for_4.6/net-next/rds-fixes

Qing Huang (1):
  RDS: fix endianness for dp_ack_seq

Santosh Shilimkar (1):
  RDS: Fix the atomicity for congestion map update

 net/rds/cong.c  | 4 ++--
 net/rds/ib_cm.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

-- 
1.9.1



[net-next][PATCH 2/2] RDS: Fix the atomicity for congestion map update

2016-04-08 Thread Santosh Shilimkar
Two different threads with different rds sockets may be in
rds_recv_rcvbuf_delta() via receive path. If their ports
both map to the same word in the congestion map, then
using non-atomic ops to update it could cause the map to
be incorrect. Lets use atomics to avoid such an issue.

Full credit to Wengang  for
finding the issue, analysing it and also pointing out
to offending code with spin lock based fix.

Reviewed-by: Leon Romanovsky 
Signed-off-by: Wengang Wang 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/cong.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rds/cong.c b/net/rds/cong.c
index e6144b8..6641bcf 100644
--- a/net/rds/cong.c
+++ b/net/rds/cong.c
@@ -299,7 +299,7 @@ void rds_cong_set_bit(struct rds_cong_map *map, __be16 port)
i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;
 
-   __set_bit_le(off, (void *)map->m_page_addrs[i]);
+   set_bit_le(off, (void *)map->m_page_addrs[i]);
 }
 
 void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port)
@@ -313,7 +313,7 @@ void rds_cong_clear_bit(struct rds_cong_map *map, __be16 
port)
i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;
 
-   __clear_bit_le(off, (void *)map->m_page_addrs[i]);
+   clear_bit_le(off, (void *)map->m_page_addrs[i]);
 }
 
 static int rds_cong_test_bit(struct rds_cong_map *map, __be16 port)
-- 
1.9.1



[net-next][PATCH 1/2] RDS: fix endianness for dp_ack_seq

2016-04-08 Thread Santosh Shilimkar
From: Qing Huang 

dp->dp_ack_seq is used in big endian format. We need to do the
big endianness conversion when we assign a value in host format
to it.

Signed-off-by: Qing Huang 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_cm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 8764970..310cabc 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -194,7 +194,7 @@ static void rds_ib_cm_fill_conn_param(struct rds_connection 
*conn,
dp->dp_protocol_major = RDS_PROTOCOL_MAJOR(protocol_version);
dp->dp_protocol_minor = RDS_PROTOCOL_MINOR(protocol_version);
dp->dp_protocol_minor_mask = 
cpu_to_be16(RDS_IB_SUPPORTED_PROTOCOLS);
-   dp->dp_ack_seq = rds_ib_piggyb_ack(ic);
+   dp->dp_ack_seq = cpu_to_be64(rds_ib_piggyb_ack(ic));
 
/* Advertise flow control */
if (ic->i_flowctl) {
-- 
1.9.1



Re: [PATCH v3 2/2] usb:dwc3: pass arch data to xhci-hcd child

2016-04-04 Thread santosh shilimkar

On 4/3/2016 11:28 PM, Felipe Balbi wrote:

santosh shilimkar  writes:

+Arnd, RMK,

On 4/1/2016 4:57 AM, Felipe Balbi wrote:


Hi,

Grygorii Strashko  writes:

On 04/01/2016 01:20 PM, Felipe Balbi wrote:


[...]


commit 7ace8fc8219e4cbbfd5b4790390d9a01a2541cdf
Author: Yoshihiro Shimoda 
Date:   Mon Jul 13 18:10:05 2015 +0900

  usb: gadget: udc: core: Fix argument of dma_map_single for IOMMU

  The dma_map_single and dma_unmap_single should set "gadget->dev.parent"
  instead of "&gadget->dev" in the first argument because the parent has
  a udc controller's device pointer.
  Otherwise, iommu functions are not called in ARM environment.

  Signed-off-by: Yoshihiro Shimoda 
  Signed-off-by: Felipe Balbi 

Above actually means that DMA configuration code can be dropped from
usb_add_gadget_udc_release() completely. Right?:


true, but now I'm not sure what's better: copy all necessary bits from
parent or just pass the parent device to all DMA API.

Anybody to shed a light here ?


The expectation is drivers should pass the proper dev pointers and let
core DMA code deal with it since it knows the per device dma properties.


okay, so how do you get proper DMA pointers with something like this:

kdwc3_dma_mask = dma_get_mask(dev);
dev->dma_mask = &kdwc3_dma_mask;

This doesn't anything.


Drivers actually needs to touch dma_mask(s) only if the core DMA
code hasn't populated it it. I see Grygorii pointed out couple
of things already.

Reagrds,
Santosh





Re: [PATCH] ARM: OMAP: wakeupgen: Add comment for unhandled FROZEN transitions

2016-04-04 Thread santosh shilimkar

On 4/4/2016 5:55 AM, Anna-Maria Gleixner wrote:

FROZEN hotplug notifiers are not handled and do not have to be. Insert
a comment to remember that the lack of the FROZEN transitions is no
accident.

Cc: Tony Lindgren 
Cc: Santosh Shilimkar 
Cc: linux-o...@vger.kernel.org
Signed-off-by: Anna-Maria Gleixner 
---

Acked-by: Santosh Shilimkar 


Re: [PATCH v3 2/2] usb:dwc3: pass arch data to xhci-hcd child

2016-04-01 Thread santosh shilimkar

+Arnd, RMK,

On 4/1/2016 4:57 AM, Felipe Balbi wrote:


Hi,

Grygorii Strashko  writes:

On 04/01/2016 01:20 PM, Felipe Balbi wrote:


[...]


commit 7ace8fc8219e4cbbfd5b4790390d9a01a2541cdf
Author: Yoshihiro Shimoda 
Date:   Mon Jul 13 18:10:05 2015 +0900

 usb: gadget: udc: core: Fix argument of dma_map_single for IOMMU

 The dma_map_single and dma_unmap_single should set "gadget->dev.parent"
 instead of "&gadget->dev" in the first argument because the parent has
 a udc controller's device pointer.
 Otherwise, iommu functions are not called in ARM environment.

 Signed-off-by: Yoshihiro Shimoda 
 Signed-off-by: Felipe Balbi 

Above actually means that DMA configuration code can be dropped from
usb_add_gadget_udc_release() completely. Right?:


true, but now I'm not sure what's better: copy all necessary bits from
parent or just pass the parent device to all DMA API.

Anybody to shed a light here ?

The expectation is drivers should pass the proper dev pointers and let 
core DMA code deal with it since it knows the per device dma properties.

RMK did massive series of patches to fix many drivers which were not
adhering to dma APIs.

Regrds,
Santosh



Re: [PATCH] ARM: dts: k2*: Rename the k2* files to keystone-k2* files

2016-03-19 Thread santosh shilimkar

On 3/16/2016 7:39 AM, Nishanth Menon wrote:

As reported in [1], rename the k2* dts files to keystone-* files
this will force consistency throughout.

Script for the same (and hand modified for Makefile and MAINTAINERS
files):
for i in arch/arm/boot/dts/k2*
do
b=`basename $i`;
git mv $i arch/arm/boot/dts/keystone-$b;
sed -i -e "s/$b/keystone-$b/g" arch/arm/boot/dts/*[si]
done

NOTE: bootloaders that depend on older dtb names will need to be
updated as well.

[1] http://marc.info/?l=linux-arm-kernel&m=145637407804754&w=2

Reported-by: Olof Johansson 
Signed-off-by: Nishanth Menon 
---


Thanks Nishant for taking care of this. I will add this to the
next soon.

Reagrds,
Santosh



Re: [PATCH] gpio: omap: drop dev field from gpio_bank structure

2016-03-04 Thread santosh shilimkar

On 3/4/2016 7:25 AM, Grygorii Strashko wrote:

GPIO chip structure already has "parent" field which is used for the
same purpose as "dev" field in gpio_bank structure - store pointer on
GPIO device.

Hence, drop duplicated "dev" field from gpio_bank structure.

Signed-off-by: Grygorii Strashko 
---

Looks good.

Acked-by: Santosh Shilimkar 


Re: RDS: Major clean-up with couple of new features for 4.6

2016-03-02 Thread santosh shilimkar


On 3/2/2016 11:13 AM, David Miller wrote:

From: Santosh Shilimkar 
Date: Tue,  1 Mar 2016 15:20:41 -0800


v3:
Re-generated the same series by omitting "-D" option from git format-patch
command. Since first patch has file removals, git apply/am can't deal
with it when formated with '-D' option.


Yeah this works much better, series applied, thanks.


Thanks Dave !!

Regards,
Santosh


[net-next][PATCH v3 05/13] RDS: IB: Re-organise ibmr code

2016-03-01 Thread Santosh Shilimkar
No functional changes. This is in preperation towards adding
fastreg memory resgitration support.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/Makefile  |   2 +-
 net/rds/ib.c  |  37 +++---
 net/rds/ib.h  |  25 +---
 net/rds/ib_fmr.c  | 217 +++
 net/rds/ib_mr.h   | 109 
 net/rds/ib_rdma.c | 379 +++---
 6 files changed, 422 insertions(+), 347 deletions(-)
 create mode 100644 net/rds/ib_fmr.c
 create mode 100644 net/rds/ib_mr.h

diff --git a/net/rds/Makefile b/net/rds/Makefile
index 19e5485..bcf5591 100644
--- a/net/rds/Makefile
+++ b/net/rds/Makefile
@@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o 
message.o   \
 obj-$(CONFIG_RDS_RDMA) += rds_rdma.o
 rds_rdma-y :=  rdma_transport.o \
ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
-   ib_sysctl.o ib_rdma.o
+   ib_sysctl.o ib_rdma.o ib_fmr.o
 
 
 obj-$(CONFIG_RDS_TCP) += rds_tcp.o
diff --git a/net/rds/ib.c b/net/rds/ib.c
index 9481d55..bb32cb9 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -42,15 +42,16 @@
 
 #include "rds.h"
 #include "ib.h"
+#include "ib_mr.h"
 
-unsigned int rds_ib_fmr_1m_pool_size = RDS_FMR_1M_POOL_SIZE;
-unsigned int rds_ib_fmr_8k_pool_size = RDS_FMR_8K_POOL_SIZE;
+unsigned int rds_ib_mr_1m_pool_size = RDS_MR_1M_POOL_SIZE;
+unsigned int rds_ib_mr_8k_pool_size = RDS_MR_8K_POOL_SIZE;
 unsigned int rds_ib_retry_count = RDS_IB_DEFAULT_RETRY_COUNT;
 
-module_param(rds_ib_fmr_1m_pool_size, int, 0444);
-MODULE_PARM_DESC(rds_ib_fmr_1m_pool_size, " Max number of 1M fmr per HCA");
-module_param(rds_ib_fmr_8k_pool_size, int, 0444);
-MODULE_PARM_DESC(rds_ib_fmr_8k_pool_size, " Max number of 8K fmr per HCA");
+module_param(rds_ib_mr_1m_pool_size, int, 0444);
+MODULE_PARM_DESC(rds_ib_mr_1m_pool_size, " Max number of 1M mr per HCA");
+module_param(rds_ib_mr_8k_pool_size, int, 0444);
+MODULE_PARM_DESC(rds_ib_mr_8k_pool_size, " Max number of 8K mr per HCA");
 module_param(rds_ib_retry_count, int, 0444);
 MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before reporting 
an error");
 
@@ -140,13 +141,13 @@ static void rds_ib_add_one(struct ib_device *device)
rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE);
 
rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32;
-   rds_ibdev->max_1m_fmrs = device->attrs.max_mr ?
+   rds_ibdev->max_1m_mrs = device->attrs.max_mr ?
min_t(unsigned int, (device->attrs.max_mr / 2),
- rds_ib_fmr_1m_pool_size) : rds_ib_fmr_1m_pool_size;
+ rds_ib_mr_1m_pool_size) : rds_ib_mr_1m_pool_size;
 
-   rds_ibdev->max_8k_fmrs = device->attrs.max_mr ?
+   rds_ibdev->max_8k_mrs = device->attrs.max_mr ?
min_t(unsigned int, ((device->attrs.max_mr / 2) * 
RDS_MR_8K_SCALE),
- rds_ib_fmr_8k_pool_size) : rds_ib_fmr_8k_pool_size;
+ rds_ib_mr_8k_pool_size) : rds_ib_mr_8k_pool_size;
 
rds_ibdev->max_initiator_depth = device->attrs.max_qp_init_rd_atom;
rds_ibdev->max_responder_resources = device->attrs.max_qp_rd_atom;
@@ -172,10 +173,10 @@ static void rds_ib_add_one(struct ib_device *device)
goto put_dev;
}
 
-   rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, 
fmr_max_remaps = %d, max_1m_fmrs = %d, max_8k_fmrs = %d\n",
+   rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, 
fmr_max_remaps = %d, max_1m_mrs = %d, max_8k_mrs = %d\n",
 device->attrs.max_fmr, rds_ibdev->max_wrs, rds_ibdev->max_sge,
-rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_fmrs,
-rds_ibdev->max_8k_fmrs);
+rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs,
+rds_ibdev->max_8k_mrs);
 
INIT_LIST_HEAD(&rds_ibdev->ipaddr_list);
INIT_LIST_HEAD(&rds_ibdev->conn_list);
@@ -364,7 +365,7 @@ void rds_ib_exit(void)
rds_ib_sysctl_exit();
rds_ib_recv_exit();
rds_trans_unregister(&rds_ib_transport);
-   rds_ib_fmr_exit();
+   rds_ib_mr_exit();
 }
 
 struct rds_transport rds_ib_transport = {
@@ -400,13 +401,13 @@ int rds_ib_init(void)
 
INIT_LIST_HEAD(&rds_ib_devices);
 
-   ret = rds_ib_fmr_init();
+   ret = rds_ib_mr_init();
if (ret)
goto out;
 
ret = ib_register_client(&rds_ib_client);
if (ret)
-   goto out_fmr_exit;
+   goto out_mr_exit;
 
ret = rds_ib_sysctl_init();
if (ret)
@@ -430,8 +431,8 @@ out_sysctl:
rds_ib_sysctl_exit();
 out_ibreg:
rds_ib_unregister_client();
-out_fmr_exi

[net-next][PATCH v3 07/13] RDS: IB: move FMR code to its own file

2016-03-01 Thread Santosh Shilimkar
No functional change.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_fmr.c  | 126 +-
 net/rds/ib_mr.h   |   6 +++
 net/rds/ib_rdma.c | 108 ++
 3 files changed, 134 insertions(+), 106 deletions(-)

diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c
index 74f2c21..4fe8f4f 100644
--- a/net/rds/ib_fmr.c
+++ b/net/rds/ib_fmr.c
@@ -37,61 +37,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
struct rds_ib_mr_pool *pool;
struct rds_ib_mr *ibmr = NULL;
struct rds_ib_fmr *fmr;
-   int err = 0, iter = 0;
+   int err = 0;
 
if (npages <= RDS_MR_8K_MSG_SIZE)
pool = rds_ibdev->mr_8k_pool;
else
pool = rds_ibdev->mr_1m_pool;
 
-   if (atomic_read(&pool->dirty_count) >= pool->max_items / 10)
-   queue_delayed_work(rds_ib_mr_wq, &pool->flush_worker, 10);
-
-   /* Switch pools if one of the pool is reaching upper limit */
-   if (atomic_read(&pool->dirty_count) >=  pool->max_items * 9 / 10) {
-   if (pool->pool_type == RDS_IB_MR_8K_POOL)
-   pool = rds_ibdev->mr_1m_pool;
-   else
-   pool = rds_ibdev->mr_8k_pool;
-   }
-
-   while (1) {
-   ibmr = rds_ib_reuse_mr(pool);
-   if (ibmr)
-   return ibmr;
-
-   /* No clean MRs - now we have the choice of either
-* allocating a fresh MR up to the limit imposed by the
-* driver, or flush any dirty unused MRs.
-* We try to avoid stalling in the send path if possible,
-* so we allocate as long as we're allowed to.
-*
-* We're fussy with enforcing the FMR limit, though. If the
-* driver tells us we can't use more than N fmrs, we shouldn't
-* start arguing with it
-*/
-   if (atomic_inc_return(&pool->item_count) <= pool->max_items)
-   break;
-
-   atomic_dec(&pool->item_count);
-
-   if (++iter > 2) {
-   if (pool->pool_type == RDS_IB_MR_8K_POOL)
-   rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_depleted);
-   else
-   rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_depleted);
-   return ERR_PTR(-EAGAIN);
-   }
-
-   /* We do have some empty MRs. Flush them out. */
-   if (pool->pool_type == RDS_IB_MR_8K_POOL)
-   rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_wait);
-   else
-   rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_wait);
-   rds_ib_flush_mr_pool(pool, 0, &ibmr);
-   if (ibmr)
-   return ibmr;
-   }
+   ibmr = rds_ib_try_reuse_ibmr(pool);
+   if (ibmr)
+   return ibmr;
 
ibmr = kzalloc_node(sizeof(*ibmr), GFP_KERNEL,
rdsibdev_to_node(rds_ibdev));
@@ -218,3 +173,76 @@ out:
 
return ret;
 }
+
+struct rds_ib_mr *rds_ib_reg_fmr(struct rds_ib_device *rds_ibdev,
+struct scatterlist *sg,
+unsigned long nents,
+u32 *key)
+{
+   struct rds_ib_mr *ibmr = NULL;
+   struct rds_ib_fmr *fmr;
+   int ret;
+
+   ibmr = rds_ib_alloc_fmr(rds_ibdev, nents);
+   if (IS_ERR(ibmr))
+   return ibmr;
+
+   ibmr->device = rds_ibdev;
+   fmr = &ibmr->u.fmr;
+   ret = rds_ib_map_fmr(rds_ibdev, ibmr, sg, nents);
+   if (ret == 0)
+   *key = fmr->fmr->rkey;
+   else
+   rds_ib_free_mr(ibmr, 0);
+
+   return ibmr;
+}
+
+void rds_ib_unreg_fmr(struct list_head *list, unsigned int *nfreed,
+ unsigned long *unpinned, unsigned int goal)
+{
+   struct rds_ib_mr *ibmr, *next;
+   struct rds_ib_fmr *fmr;
+   LIST_HEAD(fmr_list);
+   int ret = 0;
+   unsigned int freed = *nfreed;
+
+   /* String all ib_mr's onto one list and hand them to  ib_unmap_fmr */
+   list_for_each_entry(ibmr, list, unmap_list) {
+   fmr = &ibmr->u.fmr;
+   list_add(&fmr->fmr->list, &fmr_list);
+   }
+
+   ret = ib_unmap_fmr(&fmr_list);
+   if (ret)
+   pr_warn("RDS/IB: FMR invalidation failed (err=%d)\n", ret);
+
+   /* Now we can destroy the DMA mapping and unpin any pages */
+   list_for_each_entry_safe(ibmr, next, list, unmap_list) {
+   fmr = &ibmr->u.fmr;
+   *unpinned += ibmr->sg_len;
+   

[net-next][PATCH v3 12/13] RDS: IB: allocate extra space on queues for FRMR support

2016-03-01 Thread Santosh Shilimkar
Fastreg MR(FRMR) memory registration and invalidation makes use
of work request and completion queues for its operation. Patch
allocates extra queue space towards these operation(s).

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h|  4 
 net/rds/ib_cm.c | 16 
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index c5eddc2..eeb0d6c 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -14,6 +14,7 @@
 
 #define RDS_IB_DEFAULT_RECV_WR 1024
 #define RDS_IB_DEFAULT_SEND_WR 256
+#define RDS_IB_DEFAULT_FR_WR   512
 
 #define RDS_IB_DEFAULT_RETRY_COUNT 2
 
@@ -122,6 +123,9 @@ struct rds_ib_connection {
struct ib_wci_send_wc[RDS_IB_WC_MAX];
struct ib_wci_recv_wc[RDS_IB_WC_MAX];
 
+   /* To control the number of wrs from fastreg */
+   atomic_ti_fastreg_wrs;
+
/* interrupt handling */
struct tasklet_struct   i_send_tasklet;
struct tasklet_struct   i_recv_tasklet;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 7f68abc..83f4673 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -363,7 +363,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
struct ib_qp_init_attr attr;
struct ib_cq_init_attr cq_attr = {};
struct rds_ib_device *rds_ibdev;
-   int ret;
+   int ret, fr_queue_space;
 
/*
 * It's normal to see a null device if an incoming connection races
@@ -373,6 +373,12 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
if (!rds_ibdev)
return -EOPNOTSUPP;
 
+   /* The fr_queue_space is currently set to 512, to add extra space on
+* completion queue and send queue. This extra space is used for FRMR
+* registration and invalidation work requests
+*/
+   fr_queue_space = (rds_ibdev->use_fastreg ? RDS_IB_DEFAULT_FR_WR : 0);
+
/* add the conn now so that connection establishment has the dev */
rds_ib_add_conn(rds_ibdev, conn);
 
@@ -384,7 +390,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
/* Protection domain and memory range */
ic->i_pd = rds_ibdev->pd;
 
-   cq_attr.cqe = ic->i_send_ring.w_nr + 1;
+   cq_attr.cqe = ic->i_send_ring.w_nr + fr_queue_space + 1;
 
ic->i_send_cq = ib_create_cq(dev, rds_ib_cq_comp_handler_send,
 rds_ib_cq_event_handler, conn,
@@ -424,7 +430,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
attr.event_handler = rds_ib_qp_event_handler;
attr.qp_context = conn;
/* + 1 to allow for the single ack message */
-   attr.cap.max_send_wr = ic->i_send_ring.w_nr + 1;
+   attr.cap.max_send_wr = ic->i_send_ring.w_nr + fr_queue_space + 1;
attr.cap.max_recv_wr = ic->i_recv_ring.w_nr + 1;
attr.cap.max_send_sge = rds_ibdev->max_sge;
attr.cap.max_recv_sge = RDS_IB_RECV_SGE;
@@ -432,6 +438,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
attr.qp_type = IB_QPT_RC;
attr.send_cq = ic->i_send_cq;
attr.recv_cq = ic->i_recv_cq;
+   atomic_set(&ic->i_fastreg_wrs, RDS_IB_DEFAULT_FR_WR);
 
/*
 * XXX this can fail if max_*_wr is too large?  Are we supposed
@@ -751,7 +758,8 @@ void rds_ib_conn_shutdown(struct rds_connection *conn)
 */
wait_event(rds_ib_ring_empty_wait,
   rds_ib_ring_empty(&ic->i_recv_ring) &&
-  (atomic_read(&ic->i_signaled_sends) == 0));
+  (atomic_read(&ic->i_signaled_sends) == 0) &&
+  (atomic_read(&ic->i_fastreg_wrs) == 
RDS_IB_DEFAULT_FR_WR));
tasklet_kill(&ic->i_send_tasklet);
tasklet_kill(&ic->i_recv_tasklet);
 
-- 
1.9.1



[net-next][PATCH v3 04/13] RDS: IB: Remove the RDS_IB_SEND_OP dependency

2016-03-01 Thread Santosh Shilimkar
This helps to combine asynchronous fastreg MR completion handler
with send completion handler.

No functional change.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h  |  1 -
 net/rds/ib_cm.c   | 42 +++---
 net/rds/ib_send.c |  6 ++
 3 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index b3fdebb..09cd8e3 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -28,7 +28,6 @@
 #define RDS_IB_RECYCLE_BATCH_COUNT 32
 
 #define RDS_IB_WC_MAX  32
-#define RDS_IB_SEND_OP BIT_ULL(63)
 
 extern struct rw_semaphore rds_ib_devices_lock;
 extern struct list_head rds_ib_devices;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index da5a7fb..7f68abc 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -236,12 +236,10 @@ static void rds_ib_cq_comp_handler_recv(struct ib_cq *cq, 
void *context)
tasklet_schedule(&ic->i_recv_tasklet);
 }
 
-static void poll_cq(struct rds_ib_connection *ic, struct ib_cq *cq,
-   struct ib_wc *wcs,
-   struct rds_ib_ack_state *ack_state)
+static void poll_scq(struct rds_ib_connection *ic, struct ib_cq *cq,
+struct ib_wc *wcs)
 {
-   int nr;
-   int i;
+   int nr, i;
struct ib_wc *wc;
 
while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) {
@@ -251,10 +249,7 @@ static void poll_cq(struct rds_ib_connection *ic, struct 
ib_cq *cq,
 (unsigned long long)wc->wr_id, wc->status,
 wc->byte_len, be32_to_cpu(wc->ex.imm_data));
 
-   if (wc->wr_id & RDS_IB_SEND_OP)
-   rds_ib_send_cqe_handler(ic, wc);
-   else
-   rds_ib_recv_cqe_handler(ic, wc, ack_state);
+   rds_ib_send_cqe_handler(ic, wc);
}
}
 }
@@ -263,14 +258,12 @@ static void rds_ib_tasklet_fn_send(unsigned long data)
 {
struct rds_ib_connection *ic = (struct rds_ib_connection *)data;
struct rds_connection *conn = ic->conn;
-   struct rds_ib_ack_state state;
 
rds_ib_stats_inc(s_ib_tasklet_call);
 
-   memset(&state, 0, sizeof(state));
-   poll_cq(ic, ic->i_send_cq, ic->i_send_wc, &state);
+   poll_scq(ic, ic->i_send_cq, ic->i_send_wc);
ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP);
-   poll_cq(ic, ic->i_send_cq, ic->i_send_wc, &state);
+   poll_scq(ic, ic->i_send_cq, ic->i_send_wc);
 
if (rds_conn_up(conn) &&
(!test_bit(RDS_LL_SEND_FULL, &conn->c_flags) ||
@@ -278,6 +271,25 @@ static void rds_ib_tasklet_fn_send(unsigned long data)
rds_send_xmit(ic->conn);
 }
 
+static void poll_rcq(struct rds_ib_connection *ic, struct ib_cq *cq,
+struct ib_wc *wcs,
+struct rds_ib_ack_state *ack_state)
+{
+   int nr, i;
+   struct ib_wc *wc;
+
+   while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) {
+   for (i = 0; i < nr; i++) {
+   wc = wcs + i;
+   rdsdebug("wc wr_id 0x%llx status %u byte_len %u 
imm_data %u\n",
+(unsigned long long)wc->wr_id, wc->status,
+wc->byte_len, be32_to_cpu(wc->ex.imm_data));
+
+   rds_ib_recv_cqe_handler(ic, wc, ack_state);
+   }
+   }
+}
+
 static void rds_ib_tasklet_fn_recv(unsigned long data)
 {
struct rds_ib_connection *ic = (struct rds_ib_connection *)data;
@@ -291,9 +303,9 @@ static void rds_ib_tasklet_fn_recv(unsigned long data)
rds_ib_stats_inc(s_ib_tasklet_call);
 
memset(&state, 0, sizeof(state));
-   poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, &state);
+   poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, &state);
ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED);
-   poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, &state);
+   poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, &state);
 
if (state.ack_next_valid)
rds_ib_set_ack(ic, state.ack_next, state.ack_required);
diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index eac30bf..f27d2c8 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -195,7 +195,7 @@ void rds_ib_send_init_ring(struct rds_ib_connection *ic)
 
send->s_op = NULL;
 
-   send->s_wr.wr_id = i | RDS_IB_SEND_OP;
+   send->s_wr.wr_id = i;
send->s_wr.sg_list = send->s_sge;
send->s_wr.ex.imm_data = 0;
 
@@ -263,9 +263,7 @@ void rds_ib_send_cqe_handler(struct rds_ib_connection *ic, 
struct ib_wc *wc)
 
oldest = rds_ib_ring_oldest(&ic->i_send_r

[net-next][PATCH v3 01/13] RDS: Drop stale iWARP RDMA transport

2016-03-01 Thread Santosh Shilimkar
RDS iWarp support code has become stale and non testable. As
indicated earlier, am dropping the support for it.

If new iWarp user(s) shows up in future, we can adapat the RDS IB
transprt for the special RDMA READ sink case. iWarp needs an MR
for the RDMA READ sink.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 Documentation/networking/rds.txt |   4 +-
 net/rds/Kconfig  |   7 +-
 net/rds/Makefile |   4 +-
 net/rds/iw.c | 312 -
 net/rds/iw.h | 398 
 net/rds/iw_cm.c  | 769 --
 net/rds/iw_rdma.c| 837 -
 net/rds/iw_recv.c| 904 
 net/rds/iw_ring.c| 169 ---
 net/rds/iw_send.c| 981 ---
 net/rds/iw_stats.c   |  95 
 net/rds/iw_sysctl.c  | 123 -
 net/rds/rdma_transport.c |  13 +-
 net/rds/rdma_transport.h |   5 -
 14 files changed, 7 insertions(+), 4614 deletions(-)
 delete mode 100644 net/rds/iw.c
 delete mode 100644 net/rds/iw.h
 delete mode 100644 net/rds/iw_cm.c
 delete mode 100644 net/rds/iw_rdma.c
 delete mode 100644 net/rds/iw_recv.c
 delete mode 100644 net/rds/iw_ring.c
 delete mode 100644 net/rds/iw_send.c
 delete mode 100644 net/rds/iw_stats.c
 delete mode 100644 net/rds/iw_sysctl.c

diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt
index e1a3d59..9d219d8 100644
--- a/Documentation/networking/rds.txt
+++ b/Documentation/networking/rds.txt
@@ -19,9 +19,7 @@ to N*N if you use a connection-oriented socket transport like 
TCP.
 
 RDS is not Infiniband-specific; it was designed to support different
 transports.  The current implementation used to support RDS over TCP as well
-as IB. Work is in progress to support RDS over iWARP, and using DCE to
-guarantee no dropped packets on Ethernet, it may be possible to use RDS over
-UDP in the future.
+as IB.
 
 The high-level semantics of RDS from the application's point of view are
 
diff --git a/net/rds/Kconfig b/net/rds/Kconfig
index f2c670b..bffde4b 100644
--- a/net/rds/Kconfig
+++ b/net/rds/Kconfig
@@ -4,14 +4,13 @@ config RDS
depends on INET
---help---
  The RDS (Reliable Datagram Sockets) protocol provides reliable,
- sequenced delivery of datagrams over Infiniband, iWARP,
- or TCP.
+ sequenced delivery of datagrams over Infiniband or TCP.
 
 config RDS_RDMA
-   tristate "RDS over Infiniband and iWARP"
+   tristate "RDS over Infiniband"
depends on RDS && INFINIBAND && INFINIBAND_ADDR_TRANS
---help---
- Allow RDS to use Infiniband and iWARP as a transport.
+ Allow RDS to use Infiniband as a transport.
  This transport supports RDMA operations.
 
 config RDS_TCP
diff --git a/net/rds/Makefile b/net/rds/Makefile
index 56d3f60..19e5485 100644
--- a/net/rds/Makefile
+++ b/net/rds/Makefile
@@ -6,9 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o 
message.o   \
 obj-$(CONFIG_RDS_RDMA) += rds_rdma.o
 rds_rdma-y :=  rdma_transport.o \
ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
-   ib_sysctl.o ib_rdma.o \
-   iw.o iw_cm.o iw_recv.o iw_ring.o iw_send.o iw_stats.o \
-   iw_sysctl.o iw_rdma.o
+   ib_sysctl.o ib_rdma.o
 
 
 obj-$(CONFIG_RDS_TCP) += rds_tcp.o
diff --git a/net/rds/iw.c b/net/rds/iw.c
deleted file mode 100644
index f4a9fff..000
--- a/net/rds/iw.c
+++ /dev/null
@@ -1,312 +0,0 @@
-/*
- * Copyright (c) 2006 Oracle.  All rights reserved.
- *
- * This software is available to you under a choice of one of two
- * licenses.  You may choose to be licensed under the terms of the GNU
- * General Public License (GPL) Version 2, available from the file
- * COPYING in the main directory of this source tree, or the
- * OpenIB.org BSD license below:
- *
- * Redistribution and use in source and binary forms, with or
- * without modification, are permitted provided that the following
- * conditions are met:
- *
- *  - Redistributions of source code must retain the above
- *copyright notice, this list of conditions and the following
- *disclaimer.
- *
- *  - Redistributions in binary form must reproduce the above
- *copyright notice, this list of conditions and the following
- *disclaimer in the documentation and/or other materials
- *provided with the distribution.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
- * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRI

[net-next][PATCH v3 06/13] RDS: IB: create struct rds_ib_fmr

2016-03-01 Thread Santosh Shilimkar
Keep fmr related filed in its own struct. Fastreg MR structure
will be added to the union.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_fmr.c  | 17 ++---
 net/rds/ib_mr.h   | 11 +--
 net/rds/ib_rdma.c | 14 ++
 3 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c
index d4f200d..74f2c21 100644
--- a/net/rds/ib_fmr.c
+++ b/net/rds/ib_fmr.c
@@ -36,6 +36,7 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
 {
struct rds_ib_mr_pool *pool;
struct rds_ib_mr *ibmr = NULL;
+   struct rds_ib_fmr *fmr;
int err = 0, iter = 0;
 
if (npages <= RDS_MR_8K_MSG_SIZE)
@@ -99,15 +100,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
goto out_no_cigar;
}
 
-   ibmr->fmr = ib_alloc_fmr(rds_ibdev->pd,
+   fmr = &ibmr->u.fmr;
+   fmr->fmr = ib_alloc_fmr(rds_ibdev->pd,
(IB_ACCESS_LOCAL_WRITE |
 IB_ACCESS_REMOTE_READ |
 IB_ACCESS_REMOTE_WRITE |
 IB_ACCESS_REMOTE_ATOMIC),
&pool->fmr_attr);
-   if (IS_ERR(ibmr->fmr)) {
-   err = PTR_ERR(ibmr->fmr);
-   ibmr->fmr = NULL;
+   if (IS_ERR(fmr->fmr)) {
+   err = PTR_ERR(fmr->fmr);
+   fmr->fmr = NULL;
pr_warn("RDS/IB: %s failed (err=%d)\n", __func__, err);
goto out_no_cigar;
}
@@ -122,8 +124,8 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
 
 out_no_cigar:
if (ibmr) {
-   if (ibmr->fmr)
-   ib_dealloc_fmr(ibmr->fmr);
+   if (fmr->fmr)
+   ib_dealloc_fmr(fmr->fmr);
kfree(ibmr);
}
atomic_dec(&pool->item_count);
@@ -134,6 +136,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct 
rds_ib_mr *ibmr,
   struct scatterlist *sg, unsigned int nents)
 {
struct ib_device *dev = rds_ibdev->dev;
+   struct rds_ib_fmr *fmr = &ibmr->u.fmr;
struct scatterlist *scat = sg;
u64 io_addr = 0;
u64 *dma_pages;
@@ -190,7 +193,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct 
rds_ib_mr *ibmr,
(dma_addr & PAGE_MASK) + j;
}
 
-   ret = ib_map_phys_fmr(ibmr->fmr, dma_pages, page_cnt, io_addr);
+   ret = ib_map_phys_fmr(fmr->fmr, dma_pages, page_cnt, io_addr);
if (ret)
goto out;
 
diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index d88724f..309ad59 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -43,11 +43,15 @@
 #define RDS_MR_8K_SCALE(256 / (RDS_MR_8K_MSG_SIZE + 1))
 #define RDS_MR_8K_POOL_SIZE(RDS_MR_8K_SCALE * (8192 / 2))
 
+struct rds_ib_fmr {
+   struct ib_fmr   *fmr;
+   u64 *dma;
+};
+
 /* This is stored as mr->r_trans_private. */
 struct rds_ib_mr {
struct rds_ib_device*device;
struct rds_ib_mr_pool   *pool;
-   struct ib_fmr   *fmr;
 
struct llist_node   llnode;
 
@@ -57,8 +61,11 @@ struct rds_ib_mr {
 
struct scatterlist  *sg;
unsigned intsg_len;
-   u64 *dma;
int sg_dma_len;
+
+   union {
+   struct rds_ib_fmr   fmr;
+   } u;
 };
 
 /* Our own little MR pool */
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index c594519..9e608d9 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -334,6 +334,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
 int free_all, struct rds_ib_mr **ibmr_ret)
 {
struct rds_ib_mr *ibmr, *next;
+   struct rds_ib_fmr *fmr;
struct llist_node *clean_nodes;
struct llist_node *clean_tail;
LIST_HEAD(unmap_list);
@@ -395,8 +396,10 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
goto out;
 
/* String all ib_mr's onto one list and hand them to ib_unmap_fmr */
-   list_for_each_entry(ibmr, &unmap_list, unmap_list)
-   list_add(&ibmr->fmr->list, &fmr_list);
+   list_for_each_entry(ibmr, &unmap_list, unmap_list) {
+   fmr = &ibmr->u.fmr;
+   list_add(&fmr->fmr->list, &fmr_list);
+   }
 
ret = ib_unmap_fmr(&fmr_list);
if (ret)
@@ -405,6 +408,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
/* Now we can destroy the DMA mapping and unpin any pages */
list_for_each_entry_safe(ibmr, next, &unmap_list, unmap_list) {
unpinned += ibmr->sg_len;
+

[net-next][PATCH v3 10/13] RDS: IB: add mr reused stats

2016-03-01 Thread Santosh Shilimkar
Add MR reuse statistics to RDS IB transport.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h   | 2 ++
 net/rds/ib_rdma.c  | 7 ++-
 net/rds/ib_stats.c | 2 ++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index c88cb22..62fe7d5 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -259,6 +259,8 @@ struct rds_ib_statistics {
uint64_ts_ib_rdma_mr_1m_pool_flush;
uint64_ts_ib_rdma_mr_1m_pool_wait;
uint64_ts_ib_rdma_mr_1m_pool_depleted;
+   uint64_ts_ib_rdma_mr_8k_reused;
+   uint64_ts_ib_rdma_mr_1m_reused;
uint64_ts_ib_atomic_cswp;
uint64_ts_ib_atomic_fadd;
 };
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index 0e84843..ec7ea32 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -188,8 +188,13 @@ struct rds_ib_mr *rds_ib_reuse_mr(struct rds_ib_mr_pool 
*pool)
flag = this_cpu_ptr(&clean_list_grace);
set_bit(CLEAN_LIST_BUSY_BIT, flag);
ret = llist_del_first(&pool->clean_list);
-   if (ret)
+   if (ret) {
ibmr = llist_entry(ret, struct rds_ib_mr, llnode);
+   if (pool->pool_type == RDS_IB_MR_8K_POOL)
+   rds_ib_stats_inc(s_ib_rdma_mr_8k_reused);
+   else
+   rds_ib_stats_inc(s_ib_rdma_mr_1m_reused);
+   }
 
clear_bit(CLEAN_LIST_BUSY_BIT, flag);
preempt_enable();
diff --git a/net/rds/ib_stats.c b/net/rds/ib_stats.c
index d77e044..7e78dca 100644
--- a/net/rds/ib_stats.c
+++ b/net/rds/ib_stats.c
@@ -73,6 +73,8 @@ static const char *const rds_ib_stat_names[] = {
"ib_rdma_mr_1m_pool_flush",
"ib_rdma_mr_1m_pool_wait",
"ib_rdma_mr_1m_pool_depleted",
+   "ib_rdma_mr_8k_reused",
+   "ib_rdma_mr_1m_reused",
"ib_atomic_cswp",
"ib_atomic_fadd",
 };
-- 
1.9.1



[net-next][PATCH v3 03/13] MAINTAINERS: update RDS entry

2016-03-01 Thread Santosh Shilimkar
Acked-by: Chien Yen 
Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 MAINTAINERS | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 27393cf..08b084a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9067,10 +9067,14 @@ S:  Maintained
 F: drivers/net/ethernet/rdc/r6040.c
 
 RDS - RELIABLE DATAGRAM SOCKETS
-M: Chien Yen 
+M: Santosh Shilimkar 
+L: net...@vger.kernel.org
+L: linux-r...@vger.kernel.org
 L: rds-de...@oss.oracle.com (moderated for non-subscribers)
+W: https://oss.oracle.com/projects/rds/
 S: Supported
 F: net/rds/
+F: Documentation/networking/rds.txt
 
 READ-COPY UPDATE (RCU)
 M: "Paul E. McKenney" 
-- 
1.9.1



RDS: Major clean-up with couple of new features for 4.6

2016-03-01 Thread Santosh Shilimkar
v3:
Re-generated the same series by omitting "-D" option from git format-patch
command. Since first patch has file removals, git apply/am can't deal
with it when formated with '-D' option. 

v2:
Dropped module parameter from [PATCH 11/13] as suggested by David Miller

Series is generated against net-next but also applies against Linus's tip
cleanly. Entire patchset is available at below git tree:

 git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git 
for_4.6/net-next/rds_v2

The diff-stat looks bit scary since almost ~4K lines of code is
getting removed. Brief summary of the series:

- Drop the stale iWARP support:
RDS iWarp support code has become stale and non testable for
sometime.  As discussed and agreed earlier on list, am dropping
its support for good. If new iWarp user(s) shows up in future,
the plan is to adapt existing IB RDMA with special sink case.
- RDS gets SO_TIMESTAMP support
- Long due RDS maintainer entry gets updated
- Some RDS IB code refactoring towards new FastReg Memory registration (FRMR)
- Lastly the initial support for FRMR

RDS IB RDMA performance with FRMR is not yet as good as FMR and I do have
some patches in progress to address that. But they are not ready for 4.6
so I left them out of this series. 

Also am keeping eye on new CQ API adaptations like other ULPs doing and
will try to adapt RDS for the same most likely in 4.7+ timeframe. 

Santosh Shilimkar (12):
  RDS: Drop stale iWARP RDMA transport
  RDS: Add support for SO_TIMESTAMP for incoming messages
  MAINTAINERS: update RDS entry
  RDS: IB: Remove the RDS_IB_SEND_OP dependency
  RDS: IB: Re-organise ibmr code
  RDS: IB: create struct rds_ib_fmr
  RDS: IB: move FMR code to its own file
  RDS: IB: add connection info to ibmr
  RDS: IB: handle the RDMA CM time wait event
  RDS: IB: add mr reused stats
  RDS: IB: add Fastreg MR (FRMR) detection support
  RDS: IB: allocate extra space on queues for FRMR support

Avinash Repaka (1):
  RDS: IB: Support Fastreg MR (FRMR) memory registration mode

 Documentation/networking/rds.txt |   4 +-
 MAINTAINERS  |   6 +-
 net/rds/Kconfig  |   7 +-
 net/rds/Makefile |   4 +-
 net/rds/af_rds.c |  26 ++
 net/rds/ib.c |  47 +-
 net/rds/ib.h |  37 +-
 net/rds/ib_cm.c  |  59 ++-
 net/rds/ib_fmr.c | 248 ++
 net/rds/ib_frmr.c| 376 +++
 net/rds/ib_mr.h  | 148 ++
 net/rds/ib_rdma.c| 495 ++--
 net/rds/ib_send.c|   6 +-
 net/rds/ib_stats.c   |   2 +
 net/rds/iw.c | 312 -
 net/rds/iw.h | 398 
 net/rds/iw_cm.c  | 769 --
 net/rds/iw_rdma.c| 837 -
 net/rds/iw_recv.c| 904 
 net/rds/iw_ring.c| 169 ---
 net/rds/iw_send.c| 981 ---
 net/rds/iw_stats.c   |  95 
 net/rds/iw_sysctl.c  | 123 -
 net/rds/rdma_transport.c |  21 +-
 net/rds/rdma_transport.h |   5 -
 net/rds/rds.h|   1 +
 net/rds/recv.c   |  20 +-
 27 files changed, 1065 insertions(+), 5035 deletions(-)
 create mode 100644 net/rds/ib_fmr.c
 create mode 100644 net/rds/ib_frmr.c
 create mode 100644 net/rds/ib_mr.h
 delete mode 100644 net/rds/iw.c
 delete mode 100644 net/rds/iw.h
 delete mode 100644 net/rds/iw_cm.c
 delete mode 100644 net/rds/iw_rdma.c
 delete mode 100644 net/rds/iw_recv.c
 delete mode 100644 net/rds/iw_ring.c
 delete mode 100644 net/rds/iw_send.c
 delete mode 100644 net/rds/iw_stats.c
 delete mode 100644 net/rds/iw_sysctl.c

-- 
1.9.1



[net-next][PATCH v3 13/13] RDS: IB: Support Fastreg MR (FRMR) memory registration mode

2016-03-01 Thread Santosh Shilimkar
From: Avinash Repaka 

Fastreg MR(FRMR) is another method with which one can
register memory to HCA. Some of the newer HCAs supports only fastreg
mr mode, so we need to add support for it to have RDS functional
on them.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Avinash Repaka 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/Makefile  |   2 +-
 net/rds/ib.h  |   1 +
 net/rds/ib_cm.c   |   7 +-
 net/rds/ib_frmr.c | 376 ++
 net/rds/ib_mr.h   |  24 
 net/rds/ib_rdma.c |  17 ++-
 6 files changed, 422 insertions(+), 5 deletions(-)
 create mode 100644 net/rds/ib_frmr.c

diff --git a/net/rds/Makefile b/net/rds/Makefile
index bcf5591..0e72bec 100644
--- a/net/rds/Makefile
+++ b/net/rds/Makefile
@@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o 
message.o   \
 obj-$(CONFIG_RDS_RDMA) += rds_rdma.o
 rds_rdma-y :=  rdma_transport.o \
ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
-   ib_sysctl.o ib_rdma.o ib_fmr.o
+   ib_sysctl.o ib_rdma.o ib_fmr.o ib_frmr.o
 
 
 obj-$(CONFIG_RDS_TCP) += rds_tcp.o
diff --git a/net/rds/ib.h b/net/rds/ib.h
index eeb0d6c..627fb79 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -349,6 +349,7 @@ int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, 
__be32 ipaddr);
 void rds_ib_add_conn(struct rds_ib_device *rds_ibdev, struct rds_connection 
*conn);
 void rds_ib_remove_conn(struct rds_ib_device *rds_ibdev, struct rds_connection 
*conn);
 void rds_ib_destroy_nodev_conns(void);
+void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc);
 
 /* ib_recv.c */
 int rds_ib_recv_init(void);
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 83f4673..8764970 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -249,7 +249,12 @@ static void poll_scq(struct rds_ib_connection *ic, struct 
ib_cq *cq,
 (unsigned long long)wc->wr_id, wc->status,
 wc->byte_len, be32_to_cpu(wc->ex.imm_data));
 
-   rds_ib_send_cqe_handler(ic, wc);
+   if (wc->wr_id <= ic->i_send_ring.w_nr ||
+   wc->wr_id == RDS_IB_ACK_WR_ID)
+   rds_ib_send_cqe_handler(ic, wc);
+   else
+   rds_ib_mr_cqe_handler(ic, wc);
+
}
}
 }
diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
new file mode 100644
index 000..93ff038
--- /dev/null
+++ b/net/rds/ib_frmr.c
@@ -0,0 +1,376 @@
+/*
+ * Copyright (c) 2016 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include "ib_mr.h"
+
+static struct rds_ib_mr *rds_ib_alloc_frmr(struct rds_ib_device *rds_ibdev,
+  int npages)
+{
+   struct rds_ib_mr_pool *pool;
+   struct rds_ib_mr *ibmr = NULL;
+   struct rds_ib_frmr *frmr;
+   int err = 0;
+
+   if (npages <= RDS_MR_8K_MSG_SIZE)
+   pool = rds_ibdev->mr_8k_pool;
+   else
+   pool = rds_ibdev->mr_1m_pool;
+
+   ibmr = rds_ib_try_reuse_ibmr(pool);
+   if (ibmr)
+   return ibmr;
+
+   ibmr = kzalloc_node(sizeof(*ibmr), GFP_KERNEL,
+   rdsibdev_to_node(rds_ibdev));
+   if (!ibmr) {
+   err = -ENOMEM;
+   goto out_no_cigar;
+   }
+
+   frmr = &ibmr->u.frmr;
+   frmr->mr = ib_alloc_mr(rds_ibdev->pd, IB_MR_TYPE_MEM_REG,
+pool->fmr_att

[net-next][PATCH v3 08/13] RDS: IB: add connection info to ibmr

2016-03-01 Thread Santosh Shilimkar
Preperatory patch for FRMR support. From connection info,
we can retrieve cm_id which contains qp handled needed for
work request posting.

We also need to drop the RDS connection on QP error states
where connection handle becomes useful.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_mr.h | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index f5c1fcb..add7725 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -50,18 +50,19 @@ struct rds_ib_fmr {
 
 /* This is stored as mr->r_trans_private. */
 struct rds_ib_mr {
-   struct rds_ib_device*device;
-   struct rds_ib_mr_pool   *pool;
+   struct rds_ib_device*device;
+   struct rds_ib_mr_pool   *pool;
+   struct rds_ib_connection*ic;
 
-   struct llist_node   llnode;
+   struct llist_node   llnode;
 
/* unmap_list is for freeing */
-   struct list_headunmap_list;
-   unsigned intremap_count;
+   struct list_headunmap_list;
+   unsigned intremap_count;
 
-   struct scatterlist  *sg;
-   unsigned intsg_len;
-   int sg_dma_len;
+   struct scatterlist  *sg;
+   unsigned intsg_len;
+   int sg_dma_len;
 
union {
struct rds_ib_fmr   fmr;
-- 
1.9.1



[net-next][PATCH v3 09/13] RDS: IB: handle the RDMA CM time wait event

2016-03-01 Thread Santosh Shilimkar
Drop the RDS connection on RDMA_CM_EVENT_TIMEWAIT_EXIT so that
it can reconnect and resume.

While testing fastreg, this error happened in couple of tests but
was getting un-noticed.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/rdma_transport.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index 4f4b3d8..7220beb 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -117,6 +117,14 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
rds_conn_drop(conn);
break;
 
+   case RDMA_CM_EVENT_TIMEWAIT_EXIT:
+   if (conn) {
+   pr_info("RDS: RDMA_CM_EVENT_TIMEWAIT_EXIT event: 
dropping connection %pI4->%pI4\n",
+   &conn->c_laddr, &conn->c_faddr);
+   rds_conn_drop(conn);
+   }
+   break;
+
default:
/* things like device disconnect? */
printk(KERN_ERR "RDS: unknown event %u (%s)!\n",
-- 
1.9.1



[net-next][PATCH v3 02/13] RDS: Add support for SO_TIMESTAMP for incoming messages

2016-03-01 Thread Santosh Shilimkar
The SO_TIMESTAMP generates time stamp for each incoming RDS messages
User app can enable it by using SO_TIMESTAMP setsocketopt() at
SOL_SOCKET level. CMSG data of cmsg type SO_TIMESTAMP contains the
time stamp in struct timeval format.

Reviewed-by: Sowmini Varadhan 
Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/af_rds.c | 26 ++
 net/rds/rds.h|  1 +
 net/rds/recv.c   | 20 ++--
 3 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index b5476aeb..6beaeb1 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -277,6 +277,27 @@ static int rds_set_transport(struct rds_sock *rs, char 
__user *optval,
return rs->rs_transport ? 0 : -ENOPROTOOPT;
 }
 
+static int rds_enable_recvtstamp(struct sock *sk, char __user *optval,
+int optlen)
+{
+   int val, valbool;
+
+   if (optlen != sizeof(int))
+   return -EFAULT;
+
+   if (get_user(val, (int __user *)optval))
+   return -EFAULT;
+
+   valbool = val ? 1 : 0;
+
+   if (valbool)
+   sock_set_flag(sk, SOCK_RCVTSTAMP);
+   else
+   sock_reset_flag(sk, SOCK_RCVTSTAMP);
+
+   return 0;
+}
+
 static int rds_setsockopt(struct socket *sock, int level, int optname,
  char __user *optval, unsigned int optlen)
 {
@@ -312,6 +333,11 @@ static int rds_setsockopt(struct socket *sock, int level, 
int optname,
ret = rds_set_transport(rs, optval, optlen);
release_sock(sock->sk);
break;
+   case SO_TIMESTAMP:
+   lock_sock(sock->sk);
+   ret = rds_enable_recvtstamp(sock->sk, optval, optlen);
+   release_sock(sock->sk);
+   break;
default:
ret = -ENOPROTOOPT;
}
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 0e2797b..80256b0 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -222,6 +222,7 @@ struct rds_incoming {
__be32  i_saddr;
 
rds_rdma_cookie_t   i_rdma_cookie;
+   struct timeval  i_rx_tstamp;
 };
 
 struct rds_mr {
diff --git a/net/rds/recv.c b/net/rds/recv.c
index a00462b..c0be1ec 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -35,6 +35,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "rds.h"
 
@@ -46,6 +48,8 @@ void rds_inc_init(struct rds_incoming *inc, struct 
rds_connection *conn,
inc->i_conn = conn;
inc->i_saddr = saddr;
inc->i_rdma_cookie = 0;
+   inc->i_rx_tstamp.tv_sec = 0;
+   inc->i_rx_tstamp.tv_usec = 0;
 }
 EXPORT_SYMBOL_GPL(rds_inc_init);
 
@@ -228,6 +232,8 @@ void rds_recv_incoming(struct rds_connection *conn, __be32 
saddr, __be32 daddr,
rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong,
  be32_to_cpu(inc->i_hdr.h_len),
  inc->i_hdr.h_dport);
+   if (sock_flag(sk, SOCK_RCVTSTAMP))
+   do_gettimeofday(&inc->i_rx_tstamp);
rds_inc_addref(inc);
list_add_tail(&inc->i_item, &rs->rs_recv_queue);
__rds_wake_sk_sleep(sk);
@@ -381,7 +387,8 @@ static int rds_notify_cong(struct rds_sock *rs, struct 
msghdr *msghdr)
 /*
  * Receive any control messages.
  */
-static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg)
+static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg,
+struct rds_sock *rs)
 {
int ret = 0;
 
@@ -392,6 +399,15 @@ static int rds_cmsg_recv(struct rds_incoming *inc, struct 
msghdr *msg)
return ret;
}
 
+   if ((inc->i_rx_tstamp.tv_sec != 0) &&
+   sock_flag(rds_rs_to_sk(rs), SOCK_RCVTSTAMP)) {
+   ret = put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMP,
+  sizeof(struct timeval),
+  &inc->i_rx_tstamp);
+   if (ret)
+   return ret;
+   }
+
return 0;
 }
 
@@ -474,7 +490,7 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, 
size_t size,
msg->msg_flags |= MSG_TRUNC;
}
 
-   if (rds_cmsg_recv(inc, msg)) {
+   if (rds_cmsg_recv(inc, msg, rs)) {
ret = -EFAULT;
goto out;
}
-- 
1.9.1



[net-next][PATCH v3 11/13] RDS: IB: add Fastreg MR (FRMR) detection support

2016-03-01 Thread Santosh Shilimkar
Discovere Fast Memmory Registration support using IB device
IB_DEVICE_MEM_MGT_EXTENSIONS. Certain HCA might support just FRMR
or FMR or both FMR and FRWR. In case both mr type are supported,
default FMR is used.

Default MR is still kept as FMR against what everyone else
is following. Default will be changed to FRMR once the
RDS performance with FRMR is comparable with FMR. The
work is in progress for the same.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.c| 10 ++
 net/rds/ib.h|  4 
 net/rds/ib_mr.h |  1 +
 3 files changed, 15 insertions(+)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index bb32cb9..b5342fd 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -140,6 +140,12 @@ static void rds_ib_add_one(struct ib_device *device)
rds_ibdev->max_wrs = device->attrs.max_qp_wr;
rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE);
 
+   rds_ibdev->has_fr = (device->attrs.device_cap_flags &
+ IB_DEVICE_MEM_MGT_EXTENSIONS);
+   rds_ibdev->has_fmr = (device->alloc_fmr && device->dealloc_fmr &&
+   device->map_phys_fmr && device->unmap_fmr);
+   rds_ibdev->use_fastreg = (rds_ibdev->has_fr && !rds_ibdev->has_fmr);
+
rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32;
rds_ibdev->max_1m_mrs = device->attrs.max_mr ?
min_t(unsigned int, (device->attrs.max_mr / 2),
@@ -178,6 +184,10 @@ static void rds_ib_add_one(struct ib_device *device)
 rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs,
 rds_ibdev->max_8k_mrs);
 
+   pr_info("RDS/IB: %s: %s supported and preferred\n",
+   device->name,
+   rds_ibdev->use_fastreg ? "FRMR" : "FMR");
+
INIT_LIST_HEAD(&rds_ibdev->ipaddr_list);
INIT_LIST_HEAD(&rds_ibdev->conn_list);
 
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 62fe7d5..c5eddc2 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -200,6 +200,10 @@ struct rds_ib_device {
struct list_headconn_list;
struct ib_device*dev;
struct ib_pd*pd;
+   boolhas_fmr;
+   boolhas_fr;
+   booluse_fastreg;
+
unsigned intmax_mrs;
struct rds_ib_mr_pool   *mr_1m_pool;
struct rds_ib_mr_pool   *mr_8k_pool;
diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index add7725..2f9b9c3 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -93,6 +93,7 @@ struct rds_ib_mr_pool {
 extern struct workqueue_struct *rds_ib_mr_wq;
 extern unsigned int rds_ib_mr_1m_pool_size;
 extern unsigned int rds_ib_mr_8k_pool_size;
+extern bool prefer_frmr;
 
 struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *rds_dev,
 int npages);
-- 
1.9.1



Re: [net-next][PATCH v2 00/13] RDS: Major clean-up with couple of new features for 4.6

2016-03-01 Thread santosh shilimkar

On 3/1/2016 2:33 PM, David Miller wrote:


When I try to apply this series, it (strangely) fails on the first patch with:


Strange indeed since patches and the tree is against net-next.


Applying: RDS: Drop stale iWARP RDMA transport
error: removal patch leaves file contents

This patch has file removals and looks like git am/apply won't
work when patch is formatted with "-D". Its good
for review but didn't realize it will create problem for
apply. Sorry for that but I didn't know this issue.

git merge or pull seems to work though when tried from branch
directly.



Please sort this out and respin, thanks.


OK. Will send the same series again just with first
patch generated without -D option. Thanks !!

Regards,
Santosh


[net-next][PATCH v2 02/13] RDS: Add support for SO_TIMESTAMP for incoming messages

2016-02-27 Thread Santosh Shilimkar
The SO_TIMESTAMP generates time stamp for each incoming RDS messages
User app can enable it by using SO_TIMESTAMP setsocketopt() at
SOL_SOCKET level. CMSG data of cmsg type SO_TIMESTAMP contains the
time stamp in struct timeval format.

Reviewed-by: Sowmini Varadhan 
Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/af_rds.c | 26 ++
 net/rds/rds.h|  1 +
 net/rds/recv.c   | 20 ++--
 3 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index b5476aeb..6beaeb1 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -277,6 +277,27 @@ static int rds_set_transport(struct rds_sock *rs, char 
__user *optval,
return rs->rs_transport ? 0 : -ENOPROTOOPT;
 }
 
+static int rds_enable_recvtstamp(struct sock *sk, char __user *optval,
+int optlen)
+{
+   int val, valbool;
+
+   if (optlen != sizeof(int))
+   return -EFAULT;
+
+   if (get_user(val, (int __user *)optval))
+   return -EFAULT;
+
+   valbool = val ? 1 : 0;
+
+   if (valbool)
+   sock_set_flag(sk, SOCK_RCVTSTAMP);
+   else
+   sock_reset_flag(sk, SOCK_RCVTSTAMP);
+
+   return 0;
+}
+
 static int rds_setsockopt(struct socket *sock, int level, int optname,
  char __user *optval, unsigned int optlen)
 {
@@ -312,6 +333,11 @@ static int rds_setsockopt(struct socket *sock, int level, 
int optname,
ret = rds_set_transport(rs, optval, optlen);
release_sock(sock->sk);
break;
+   case SO_TIMESTAMP:
+   lock_sock(sock->sk);
+   ret = rds_enable_recvtstamp(sock->sk, optval, optlen);
+   release_sock(sock->sk);
+   break;
default:
ret = -ENOPROTOOPT;
}
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 0e2797b..80256b0 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -222,6 +222,7 @@ struct rds_incoming {
__be32  i_saddr;
 
rds_rdma_cookie_t   i_rdma_cookie;
+   struct timeval  i_rx_tstamp;
 };
 
 struct rds_mr {
diff --git a/net/rds/recv.c b/net/rds/recv.c
index a00462b..c0be1ec 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -35,6 +35,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "rds.h"
 
@@ -46,6 +48,8 @@ void rds_inc_init(struct rds_incoming *inc, struct 
rds_connection *conn,
inc->i_conn = conn;
inc->i_saddr = saddr;
inc->i_rdma_cookie = 0;
+   inc->i_rx_tstamp.tv_sec = 0;
+   inc->i_rx_tstamp.tv_usec = 0;
 }
 EXPORT_SYMBOL_GPL(rds_inc_init);
 
@@ -228,6 +232,8 @@ void rds_recv_incoming(struct rds_connection *conn, __be32 
saddr, __be32 daddr,
rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong,
  be32_to_cpu(inc->i_hdr.h_len),
  inc->i_hdr.h_dport);
+   if (sock_flag(sk, SOCK_RCVTSTAMP))
+   do_gettimeofday(&inc->i_rx_tstamp);
rds_inc_addref(inc);
list_add_tail(&inc->i_item, &rs->rs_recv_queue);
__rds_wake_sk_sleep(sk);
@@ -381,7 +387,8 @@ static int rds_notify_cong(struct rds_sock *rs, struct 
msghdr *msghdr)
 /*
  * Receive any control messages.
  */
-static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg)
+static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg,
+struct rds_sock *rs)
 {
int ret = 0;
 
@@ -392,6 +399,15 @@ static int rds_cmsg_recv(struct rds_incoming *inc, struct 
msghdr *msg)
return ret;
}
 
+   if ((inc->i_rx_tstamp.tv_sec != 0) &&
+   sock_flag(rds_rs_to_sk(rs), SOCK_RCVTSTAMP)) {
+   ret = put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMP,
+  sizeof(struct timeval),
+  &inc->i_rx_tstamp);
+   if (ret)
+   return ret;
+   }
+
return 0;
 }
 
@@ -474,7 +490,7 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, 
size_t size,
msg->msg_flags |= MSG_TRUNC;
}
 
-   if (rds_cmsg_recv(inc, msg)) {
+   if (rds_cmsg_recv(inc, msg, rs)) {
ret = -EFAULT;
goto out;
}
-- 
1.9.1



[net-next][PATCH v2 05/13] RDS: IB: Re-organise ibmr code

2016-02-27 Thread Santosh Shilimkar
No functional changes. This is in preperation towards adding
fastreg memory resgitration support.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/Makefile  |   2 +-
 net/rds/ib.c  |  37 +++---
 net/rds/ib.h  |  25 +---
 net/rds/ib_fmr.c  | 217 +++
 net/rds/ib_mr.h   | 109 
 net/rds/ib_rdma.c | 379 +++---
 6 files changed, 422 insertions(+), 347 deletions(-)
 create mode 100644 net/rds/ib_fmr.c
 create mode 100644 net/rds/ib_mr.h

diff --git a/net/rds/Makefile b/net/rds/Makefile
index 19e5485..bcf5591 100644
--- a/net/rds/Makefile
+++ b/net/rds/Makefile
@@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o 
message.o   \
 obj-$(CONFIG_RDS_RDMA) += rds_rdma.o
 rds_rdma-y :=  rdma_transport.o \
ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
-   ib_sysctl.o ib_rdma.o
+   ib_sysctl.o ib_rdma.o ib_fmr.o
 
 
 obj-$(CONFIG_RDS_TCP) += rds_tcp.o
diff --git a/net/rds/ib.c b/net/rds/ib.c
index 9481d55..bb32cb9 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -42,15 +42,16 @@
 
 #include "rds.h"
 #include "ib.h"
+#include "ib_mr.h"
 
-unsigned int rds_ib_fmr_1m_pool_size = RDS_FMR_1M_POOL_SIZE;
-unsigned int rds_ib_fmr_8k_pool_size = RDS_FMR_8K_POOL_SIZE;
+unsigned int rds_ib_mr_1m_pool_size = RDS_MR_1M_POOL_SIZE;
+unsigned int rds_ib_mr_8k_pool_size = RDS_MR_8K_POOL_SIZE;
 unsigned int rds_ib_retry_count = RDS_IB_DEFAULT_RETRY_COUNT;
 
-module_param(rds_ib_fmr_1m_pool_size, int, 0444);
-MODULE_PARM_DESC(rds_ib_fmr_1m_pool_size, " Max number of 1M fmr per HCA");
-module_param(rds_ib_fmr_8k_pool_size, int, 0444);
-MODULE_PARM_DESC(rds_ib_fmr_8k_pool_size, " Max number of 8K fmr per HCA");
+module_param(rds_ib_mr_1m_pool_size, int, 0444);
+MODULE_PARM_DESC(rds_ib_mr_1m_pool_size, " Max number of 1M mr per HCA");
+module_param(rds_ib_mr_8k_pool_size, int, 0444);
+MODULE_PARM_DESC(rds_ib_mr_8k_pool_size, " Max number of 8K mr per HCA");
 module_param(rds_ib_retry_count, int, 0444);
 MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before reporting 
an error");
 
@@ -140,13 +141,13 @@ static void rds_ib_add_one(struct ib_device *device)
rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE);
 
rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32;
-   rds_ibdev->max_1m_fmrs = device->attrs.max_mr ?
+   rds_ibdev->max_1m_mrs = device->attrs.max_mr ?
min_t(unsigned int, (device->attrs.max_mr / 2),
- rds_ib_fmr_1m_pool_size) : rds_ib_fmr_1m_pool_size;
+ rds_ib_mr_1m_pool_size) : rds_ib_mr_1m_pool_size;
 
-   rds_ibdev->max_8k_fmrs = device->attrs.max_mr ?
+   rds_ibdev->max_8k_mrs = device->attrs.max_mr ?
min_t(unsigned int, ((device->attrs.max_mr / 2) * 
RDS_MR_8K_SCALE),
- rds_ib_fmr_8k_pool_size) : rds_ib_fmr_8k_pool_size;
+ rds_ib_mr_8k_pool_size) : rds_ib_mr_8k_pool_size;
 
rds_ibdev->max_initiator_depth = device->attrs.max_qp_init_rd_atom;
rds_ibdev->max_responder_resources = device->attrs.max_qp_rd_atom;
@@ -172,10 +173,10 @@ static void rds_ib_add_one(struct ib_device *device)
goto put_dev;
}
 
-   rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, 
fmr_max_remaps = %d, max_1m_fmrs = %d, max_8k_fmrs = %d\n",
+   rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, 
fmr_max_remaps = %d, max_1m_mrs = %d, max_8k_mrs = %d\n",
 device->attrs.max_fmr, rds_ibdev->max_wrs, rds_ibdev->max_sge,
-rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_fmrs,
-rds_ibdev->max_8k_fmrs);
+rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs,
+rds_ibdev->max_8k_mrs);
 
INIT_LIST_HEAD(&rds_ibdev->ipaddr_list);
INIT_LIST_HEAD(&rds_ibdev->conn_list);
@@ -364,7 +365,7 @@ void rds_ib_exit(void)
rds_ib_sysctl_exit();
rds_ib_recv_exit();
rds_trans_unregister(&rds_ib_transport);
-   rds_ib_fmr_exit();
+   rds_ib_mr_exit();
 }
 
 struct rds_transport rds_ib_transport = {
@@ -400,13 +401,13 @@ int rds_ib_init(void)
 
INIT_LIST_HEAD(&rds_ib_devices);
 
-   ret = rds_ib_fmr_init();
+   ret = rds_ib_mr_init();
if (ret)
goto out;
 
ret = ib_register_client(&rds_ib_client);
if (ret)
-   goto out_fmr_exit;
+   goto out_mr_exit;
 
ret = rds_ib_sysctl_init();
if (ret)
@@ -430,8 +431,8 @@ out_sysctl:
rds_ib_sysctl_exit();
 out_ibreg:
rds_ib_unregister_client();
-out_fmr_exi

[net-next][PATCH v2 04/13] RDS: IB: Remove the RDS_IB_SEND_OP dependency

2016-02-27 Thread Santosh Shilimkar
This helps to combine asynchronous fastreg MR completion handler
with send completion handler.

No functional change.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h  |  1 -
 net/rds/ib_cm.c   | 42 +++---
 net/rds/ib_send.c |  6 ++
 3 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index b3fdebb..09cd8e3 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -28,7 +28,6 @@
 #define RDS_IB_RECYCLE_BATCH_COUNT 32
 
 #define RDS_IB_WC_MAX  32
-#define RDS_IB_SEND_OP BIT_ULL(63)
 
 extern struct rw_semaphore rds_ib_devices_lock;
 extern struct list_head rds_ib_devices;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index da5a7fb..7f68abc 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -236,12 +236,10 @@ static void rds_ib_cq_comp_handler_recv(struct ib_cq *cq, 
void *context)
tasklet_schedule(&ic->i_recv_tasklet);
 }
 
-static void poll_cq(struct rds_ib_connection *ic, struct ib_cq *cq,
-   struct ib_wc *wcs,
-   struct rds_ib_ack_state *ack_state)
+static void poll_scq(struct rds_ib_connection *ic, struct ib_cq *cq,
+struct ib_wc *wcs)
 {
-   int nr;
-   int i;
+   int nr, i;
struct ib_wc *wc;
 
while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) {
@@ -251,10 +249,7 @@ static void poll_cq(struct rds_ib_connection *ic, struct 
ib_cq *cq,
 (unsigned long long)wc->wr_id, wc->status,
 wc->byte_len, be32_to_cpu(wc->ex.imm_data));
 
-   if (wc->wr_id & RDS_IB_SEND_OP)
-   rds_ib_send_cqe_handler(ic, wc);
-   else
-   rds_ib_recv_cqe_handler(ic, wc, ack_state);
+   rds_ib_send_cqe_handler(ic, wc);
}
}
 }
@@ -263,14 +258,12 @@ static void rds_ib_tasklet_fn_send(unsigned long data)
 {
struct rds_ib_connection *ic = (struct rds_ib_connection *)data;
struct rds_connection *conn = ic->conn;
-   struct rds_ib_ack_state state;
 
rds_ib_stats_inc(s_ib_tasklet_call);
 
-   memset(&state, 0, sizeof(state));
-   poll_cq(ic, ic->i_send_cq, ic->i_send_wc, &state);
+   poll_scq(ic, ic->i_send_cq, ic->i_send_wc);
ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP);
-   poll_cq(ic, ic->i_send_cq, ic->i_send_wc, &state);
+   poll_scq(ic, ic->i_send_cq, ic->i_send_wc);
 
if (rds_conn_up(conn) &&
(!test_bit(RDS_LL_SEND_FULL, &conn->c_flags) ||
@@ -278,6 +271,25 @@ static void rds_ib_tasklet_fn_send(unsigned long data)
rds_send_xmit(ic->conn);
 }
 
+static void poll_rcq(struct rds_ib_connection *ic, struct ib_cq *cq,
+struct ib_wc *wcs,
+struct rds_ib_ack_state *ack_state)
+{
+   int nr, i;
+   struct ib_wc *wc;
+
+   while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) {
+   for (i = 0; i < nr; i++) {
+   wc = wcs + i;
+   rdsdebug("wc wr_id 0x%llx status %u byte_len %u 
imm_data %u\n",
+(unsigned long long)wc->wr_id, wc->status,
+wc->byte_len, be32_to_cpu(wc->ex.imm_data));
+
+   rds_ib_recv_cqe_handler(ic, wc, ack_state);
+   }
+   }
+}
+
 static void rds_ib_tasklet_fn_recv(unsigned long data)
 {
struct rds_ib_connection *ic = (struct rds_ib_connection *)data;
@@ -291,9 +303,9 @@ static void rds_ib_tasklet_fn_recv(unsigned long data)
rds_ib_stats_inc(s_ib_tasklet_call);
 
memset(&state, 0, sizeof(state));
-   poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, &state);
+   poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, &state);
ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED);
-   poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, &state);
+   poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, &state);
 
if (state.ack_next_valid)
rds_ib_set_ack(ic, state.ack_next, state.ack_required);
diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index eac30bf..f27d2c8 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -195,7 +195,7 @@ void rds_ib_send_init_ring(struct rds_ib_connection *ic)
 
send->s_op = NULL;
 
-   send->s_wr.wr_id = i | RDS_IB_SEND_OP;
+   send->s_wr.wr_id = i;
send->s_wr.sg_list = send->s_sge;
send->s_wr.ex.imm_data = 0;
 
@@ -263,9 +263,7 @@ void rds_ib_send_cqe_handler(struct rds_ib_connection *ic, 
struct ib_wc *wc)
 
oldest = rds_ib_ring_oldest(&ic->i_send_r

[net-next][PATCH v2 06/13] RDS: IB: create struct rds_ib_fmr

2016-02-27 Thread Santosh Shilimkar
Keep fmr related filed in its own struct. Fastreg MR structure
will be added to the union.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_fmr.c  | 17 ++---
 net/rds/ib_mr.h   | 11 +--
 net/rds/ib_rdma.c | 14 ++
 3 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c
index d4f200d..74f2c21 100644
--- a/net/rds/ib_fmr.c
+++ b/net/rds/ib_fmr.c
@@ -36,6 +36,7 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
 {
struct rds_ib_mr_pool *pool;
struct rds_ib_mr *ibmr = NULL;
+   struct rds_ib_fmr *fmr;
int err = 0, iter = 0;
 
if (npages <= RDS_MR_8K_MSG_SIZE)
@@ -99,15 +100,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
goto out_no_cigar;
}
 
-   ibmr->fmr = ib_alloc_fmr(rds_ibdev->pd,
+   fmr = &ibmr->u.fmr;
+   fmr->fmr = ib_alloc_fmr(rds_ibdev->pd,
(IB_ACCESS_LOCAL_WRITE |
 IB_ACCESS_REMOTE_READ |
 IB_ACCESS_REMOTE_WRITE |
 IB_ACCESS_REMOTE_ATOMIC),
&pool->fmr_attr);
-   if (IS_ERR(ibmr->fmr)) {
-   err = PTR_ERR(ibmr->fmr);
-   ibmr->fmr = NULL;
+   if (IS_ERR(fmr->fmr)) {
+   err = PTR_ERR(fmr->fmr);
+   fmr->fmr = NULL;
pr_warn("RDS/IB: %s failed (err=%d)\n", __func__, err);
goto out_no_cigar;
}
@@ -122,8 +124,8 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
 
 out_no_cigar:
if (ibmr) {
-   if (ibmr->fmr)
-   ib_dealloc_fmr(ibmr->fmr);
+   if (fmr->fmr)
+   ib_dealloc_fmr(fmr->fmr);
kfree(ibmr);
}
atomic_dec(&pool->item_count);
@@ -134,6 +136,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct 
rds_ib_mr *ibmr,
   struct scatterlist *sg, unsigned int nents)
 {
struct ib_device *dev = rds_ibdev->dev;
+   struct rds_ib_fmr *fmr = &ibmr->u.fmr;
struct scatterlist *scat = sg;
u64 io_addr = 0;
u64 *dma_pages;
@@ -190,7 +193,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct 
rds_ib_mr *ibmr,
(dma_addr & PAGE_MASK) + j;
}
 
-   ret = ib_map_phys_fmr(ibmr->fmr, dma_pages, page_cnt, io_addr);
+   ret = ib_map_phys_fmr(fmr->fmr, dma_pages, page_cnt, io_addr);
if (ret)
goto out;
 
diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index d88724f..309ad59 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -43,11 +43,15 @@
 #define RDS_MR_8K_SCALE(256 / (RDS_MR_8K_MSG_SIZE + 1))
 #define RDS_MR_8K_POOL_SIZE(RDS_MR_8K_SCALE * (8192 / 2))
 
+struct rds_ib_fmr {
+   struct ib_fmr   *fmr;
+   u64 *dma;
+};
+
 /* This is stored as mr->r_trans_private. */
 struct rds_ib_mr {
struct rds_ib_device*device;
struct rds_ib_mr_pool   *pool;
-   struct ib_fmr   *fmr;
 
struct llist_node   llnode;
 
@@ -57,8 +61,11 @@ struct rds_ib_mr {
 
struct scatterlist  *sg;
unsigned intsg_len;
-   u64 *dma;
int sg_dma_len;
+
+   union {
+   struct rds_ib_fmr   fmr;
+   } u;
 };
 
 /* Our own little MR pool */
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index c594519..9e608d9 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -334,6 +334,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
 int free_all, struct rds_ib_mr **ibmr_ret)
 {
struct rds_ib_mr *ibmr, *next;
+   struct rds_ib_fmr *fmr;
struct llist_node *clean_nodes;
struct llist_node *clean_tail;
LIST_HEAD(unmap_list);
@@ -395,8 +396,10 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
goto out;
 
/* String all ib_mr's onto one list and hand them to ib_unmap_fmr */
-   list_for_each_entry(ibmr, &unmap_list, unmap_list)
-   list_add(&ibmr->fmr->list, &fmr_list);
+   list_for_each_entry(ibmr, &unmap_list, unmap_list) {
+   fmr = &ibmr->u.fmr;
+   list_add(&fmr->fmr->list, &fmr_list);
+   }
 
ret = ib_unmap_fmr(&fmr_list);
if (ret)
@@ -405,6 +408,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
/* Now we can destroy the DMA mapping and unpin any pages */
list_for_each_entry_safe(ibmr, next, &unmap_list, unmap_list) {
unpinned += ibmr->sg_len;
+

[net-next][PATCH v2 03/13] MAINTAINERS: update RDS entry

2016-02-27 Thread Santosh Shilimkar
Acked-by: Chien Yen 
Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 MAINTAINERS | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 27393cf..08b084a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9067,10 +9067,14 @@ S:  Maintained
 F: drivers/net/ethernet/rdc/r6040.c
 
 RDS - RELIABLE DATAGRAM SOCKETS
-M: Chien Yen 
+M: Santosh Shilimkar 
+L: net...@vger.kernel.org
+L: linux-r...@vger.kernel.org
 L: rds-de...@oss.oracle.com (moderated for non-subscribers)
+W: https://oss.oracle.com/projects/rds/
 S: Supported
 F: net/rds/
+F: Documentation/networking/rds.txt
 
 READ-COPY UPDATE (RCU)
 M: "Paul E. McKenney" 
-- 
1.9.1



[net-next][PATCH v2 01/13] RDS: Drop stale iWARP RDMA transport

2016-02-27 Thread Santosh Shilimkar
RDS iWarp support code has become stale and non testable. As
indicated earlier, am dropping the support for it.

If new iWarp user(s) shows up in future, we can adapat the RDS IB
transprt for the special RDMA READ sink case. iWarp needs an MR
for the RDMA READ sink.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 Documentation/networking/rds.txt |   4 +-
 net/rds/Kconfig  |   7 +-
 net/rds/Makefile |   4 +-
 net/rds/iw.c | 312 -
 net/rds/iw.h | 398 
 net/rds/iw_cm.c  | 769 --
 net/rds/iw_rdma.c| 837 -
 net/rds/iw_recv.c| 904 
 net/rds/iw_ring.c| 169 ---
 net/rds/iw_send.c| 981 ---
 net/rds/iw_stats.c   |  95 
 net/rds/iw_sysctl.c  | 123 -
 net/rds/rdma_transport.c |  13 +-
 net/rds/rdma_transport.h |   5 -
 14 files changed, 7 insertions(+), 4614 deletions(-)
 delete mode 100644 net/rds/iw.c
 delete mode 100644 net/rds/iw.h
 delete mode 100644 net/rds/iw_cm.c
 delete mode 100644 net/rds/iw_rdma.c
 delete mode 100644 net/rds/iw_recv.c
 delete mode 100644 net/rds/iw_ring.c
 delete mode 100644 net/rds/iw_send.c
 delete mode 100644 net/rds/iw_stats.c
 delete mode 100644 net/rds/iw_sysctl.c

diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt
index e1a3d59..9d219d8 100644
--- a/Documentation/networking/rds.txt
+++ b/Documentation/networking/rds.txt
@@ -19,9 +19,7 @@ to N*N if you use a connection-oriented socket transport like 
TCP.
 
 RDS is not Infiniband-specific; it was designed to support different
 transports.  The current implementation used to support RDS over TCP as well
-as IB. Work is in progress to support RDS over iWARP, and using DCE to
-guarantee no dropped packets on Ethernet, it may be possible to use RDS over
-UDP in the future.
+as IB.
 
 The high-level semantics of RDS from the application's point of view are
 
diff --git a/net/rds/Kconfig b/net/rds/Kconfig
index f2c670b..bffde4b 100644
--- a/net/rds/Kconfig
+++ b/net/rds/Kconfig
@@ -4,14 +4,13 @@ config RDS
depends on INET
---help---
  The RDS (Reliable Datagram Sockets) protocol provides reliable,
- sequenced delivery of datagrams over Infiniband, iWARP,
- or TCP.
+ sequenced delivery of datagrams over Infiniband or TCP.
 
 config RDS_RDMA
-   tristate "RDS over Infiniband and iWARP"
+   tristate "RDS over Infiniband"
depends on RDS && INFINIBAND && INFINIBAND_ADDR_TRANS
---help---
- Allow RDS to use Infiniband and iWARP as a transport.
+ Allow RDS to use Infiniband as a transport.
  This transport supports RDMA operations.
 
 config RDS_TCP
diff --git a/net/rds/Makefile b/net/rds/Makefile
index 56d3f60..19e5485 100644
--- a/net/rds/Makefile
+++ b/net/rds/Makefile
@@ -6,9 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o 
message.o   \
 obj-$(CONFIG_RDS_RDMA) += rds_rdma.o
 rds_rdma-y :=  rdma_transport.o \
ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
-   ib_sysctl.o ib_rdma.o \
-   iw.o iw_cm.o iw_recv.o iw_ring.o iw_send.o iw_stats.o \
-   iw_sysctl.o iw_rdma.o
+   ib_sysctl.o ib_rdma.o
 
 
 obj-$(CONFIG_RDS_TCP) += rds_tcp.o
diff --git a/net/rds/iw.c b/net/rds/iw.c
deleted file mode 100644
index f4a9fff..000
diff --git a/net/rds/iw.h b/net/rds/iw.h
deleted file mode 100644
index 5af01d1..000
diff --git a/net/rds/iw_cm.c b/net/rds/iw_cm.c
deleted file mode 100644
index aea4c91..000
diff --git a/net/rds/iw_rdma.c b/net/rds/iw_rdma.c
deleted file mode 100644
index b09a40c..000
diff --git a/net/rds/iw_recv.c b/net/rds/iw_recv.c
deleted file mode 100644
index a66d179..000
diff --git a/net/rds/iw_ring.c b/net/rds/iw_ring.c
deleted file mode 100644
index da8e3b6..000
diff --git a/net/rds/iw_send.c b/net/rds/iw_send.c
deleted file mode 100644
index e20bd50..000
diff --git a/net/rds/iw_stats.c b/net/rds/iw_stats.c
deleted file mode 100644
index 5fe67f6..000
diff --git a/net/rds/iw_sysctl.c b/net/rds/iw_sysctl.c
deleted file mode 100644
index 139239d..000
diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index 9c1fed8..4f4b3d8 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -49,9 +49,7 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
rdsdebug("conn %p id %p handling event %u (%s)\n", conn, cm_id,
 event->event, rdma_event_msg(event->event));
 
-   if (cm_id->device->node_type == RDMA_NODE_RNIC)
-   trans = &

[net-next][PATCH v2 10/13] RDS: IB: add mr reused stats

2016-02-27 Thread Santosh Shilimkar
Add MR reuse statistics to RDS IB transport.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h   | 2 ++
 net/rds/ib_rdma.c  | 7 ++-
 net/rds/ib_stats.c | 2 ++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index c88cb22..62fe7d5 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -259,6 +259,8 @@ struct rds_ib_statistics {
uint64_ts_ib_rdma_mr_1m_pool_flush;
uint64_ts_ib_rdma_mr_1m_pool_wait;
uint64_ts_ib_rdma_mr_1m_pool_depleted;
+   uint64_ts_ib_rdma_mr_8k_reused;
+   uint64_ts_ib_rdma_mr_1m_reused;
uint64_ts_ib_atomic_cswp;
uint64_ts_ib_atomic_fadd;
 };
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index 0e84843..ec7ea32 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -188,8 +188,13 @@ struct rds_ib_mr *rds_ib_reuse_mr(struct rds_ib_mr_pool 
*pool)
flag = this_cpu_ptr(&clean_list_grace);
set_bit(CLEAN_LIST_BUSY_BIT, flag);
ret = llist_del_first(&pool->clean_list);
-   if (ret)
+   if (ret) {
ibmr = llist_entry(ret, struct rds_ib_mr, llnode);
+   if (pool->pool_type == RDS_IB_MR_8K_POOL)
+   rds_ib_stats_inc(s_ib_rdma_mr_8k_reused);
+   else
+   rds_ib_stats_inc(s_ib_rdma_mr_1m_reused);
+   }
 
clear_bit(CLEAN_LIST_BUSY_BIT, flag);
preempt_enable();
diff --git a/net/rds/ib_stats.c b/net/rds/ib_stats.c
index d77e044..7e78dca 100644
--- a/net/rds/ib_stats.c
+++ b/net/rds/ib_stats.c
@@ -73,6 +73,8 @@ static const char *const rds_ib_stat_names[] = {
"ib_rdma_mr_1m_pool_flush",
"ib_rdma_mr_1m_pool_wait",
"ib_rdma_mr_1m_pool_depleted",
+   "ib_rdma_mr_8k_reused",
+   "ib_rdma_mr_1m_reused",
"ib_atomic_cswp",
"ib_atomic_fadd",
 };
-- 
1.9.1



[net-next][PATCH v2 07/13] RDS: IB: move FMR code to its own file

2016-02-27 Thread Santosh Shilimkar
No functional change.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_fmr.c  | 126 +-
 net/rds/ib_mr.h   |   6 +++
 net/rds/ib_rdma.c | 108 ++
 3 files changed, 134 insertions(+), 106 deletions(-)

diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c
index 74f2c21..4fe8f4f 100644
--- a/net/rds/ib_fmr.c
+++ b/net/rds/ib_fmr.c
@@ -37,61 +37,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
struct rds_ib_mr_pool *pool;
struct rds_ib_mr *ibmr = NULL;
struct rds_ib_fmr *fmr;
-   int err = 0, iter = 0;
+   int err = 0;
 
if (npages <= RDS_MR_8K_MSG_SIZE)
pool = rds_ibdev->mr_8k_pool;
else
pool = rds_ibdev->mr_1m_pool;
 
-   if (atomic_read(&pool->dirty_count) >= pool->max_items / 10)
-   queue_delayed_work(rds_ib_mr_wq, &pool->flush_worker, 10);
-
-   /* Switch pools if one of the pool is reaching upper limit */
-   if (atomic_read(&pool->dirty_count) >=  pool->max_items * 9 / 10) {
-   if (pool->pool_type == RDS_IB_MR_8K_POOL)
-   pool = rds_ibdev->mr_1m_pool;
-   else
-   pool = rds_ibdev->mr_8k_pool;
-   }
-
-   while (1) {
-   ibmr = rds_ib_reuse_mr(pool);
-   if (ibmr)
-   return ibmr;
-
-   /* No clean MRs - now we have the choice of either
-* allocating a fresh MR up to the limit imposed by the
-* driver, or flush any dirty unused MRs.
-* We try to avoid stalling in the send path if possible,
-* so we allocate as long as we're allowed to.
-*
-* We're fussy with enforcing the FMR limit, though. If the
-* driver tells us we can't use more than N fmrs, we shouldn't
-* start arguing with it
-*/
-   if (atomic_inc_return(&pool->item_count) <= pool->max_items)
-   break;
-
-   atomic_dec(&pool->item_count);
-
-   if (++iter > 2) {
-   if (pool->pool_type == RDS_IB_MR_8K_POOL)
-   rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_depleted);
-   else
-   rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_depleted);
-   return ERR_PTR(-EAGAIN);
-   }
-
-   /* We do have some empty MRs. Flush them out. */
-   if (pool->pool_type == RDS_IB_MR_8K_POOL)
-   rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_wait);
-   else
-   rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_wait);
-   rds_ib_flush_mr_pool(pool, 0, &ibmr);
-   if (ibmr)
-   return ibmr;
-   }
+   ibmr = rds_ib_try_reuse_ibmr(pool);
+   if (ibmr)
+   return ibmr;
 
ibmr = kzalloc_node(sizeof(*ibmr), GFP_KERNEL,
rdsibdev_to_node(rds_ibdev));
@@ -218,3 +173,76 @@ out:
 
return ret;
 }
+
+struct rds_ib_mr *rds_ib_reg_fmr(struct rds_ib_device *rds_ibdev,
+struct scatterlist *sg,
+unsigned long nents,
+u32 *key)
+{
+   struct rds_ib_mr *ibmr = NULL;
+   struct rds_ib_fmr *fmr;
+   int ret;
+
+   ibmr = rds_ib_alloc_fmr(rds_ibdev, nents);
+   if (IS_ERR(ibmr))
+   return ibmr;
+
+   ibmr->device = rds_ibdev;
+   fmr = &ibmr->u.fmr;
+   ret = rds_ib_map_fmr(rds_ibdev, ibmr, sg, nents);
+   if (ret == 0)
+   *key = fmr->fmr->rkey;
+   else
+   rds_ib_free_mr(ibmr, 0);
+
+   return ibmr;
+}
+
+void rds_ib_unreg_fmr(struct list_head *list, unsigned int *nfreed,
+ unsigned long *unpinned, unsigned int goal)
+{
+   struct rds_ib_mr *ibmr, *next;
+   struct rds_ib_fmr *fmr;
+   LIST_HEAD(fmr_list);
+   int ret = 0;
+   unsigned int freed = *nfreed;
+
+   /* String all ib_mr's onto one list and hand them to  ib_unmap_fmr */
+   list_for_each_entry(ibmr, list, unmap_list) {
+   fmr = &ibmr->u.fmr;
+   list_add(&fmr->fmr->list, &fmr_list);
+   }
+
+   ret = ib_unmap_fmr(&fmr_list);
+   if (ret)
+   pr_warn("RDS/IB: FMR invalidation failed (err=%d)\n", ret);
+
+   /* Now we can destroy the DMA mapping and unpin any pages */
+   list_for_each_entry_safe(ibmr, next, list, unmap_list) {
+   fmr = &ibmr->u.fmr;
+   *unpinned += ibmr->sg_len;
+   

[net-next][PATCH v2 09/13] RDS: IB: handle the RDMA CM time wait event

2016-02-27 Thread Santosh Shilimkar
Drop the RDS connection on RDMA_CM_EVENT_TIMEWAIT_EXIT so that
it can reconnect and resume.

While testing fastreg, this error happened in couple of tests but
was getting un-noticed.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/rdma_transport.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index 4f4b3d8..7220beb 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -117,6 +117,14 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
rds_conn_drop(conn);
break;
 
+   case RDMA_CM_EVENT_TIMEWAIT_EXIT:
+   if (conn) {
+   pr_info("RDS: RDMA_CM_EVENT_TIMEWAIT_EXIT event: 
dropping connection %pI4->%pI4\n",
+   &conn->c_laddr, &conn->c_faddr);
+   rds_conn_drop(conn);
+   }
+   break;
+
default:
/* things like device disconnect? */
printk(KERN_ERR "RDS: unknown event %u (%s)!\n",
-- 
1.9.1



[net-next][PATCH v2 08/13] RDS: IB: add connection info to ibmr

2016-02-27 Thread Santosh Shilimkar
Preperatory patch for FRMR support. From connection info,
we can retrieve cm_id which contains qp handled needed for
work request posting.

We also need to drop the RDS connection on QP error states
where connection handle becomes useful.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_mr.h | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index f5c1fcb..add7725 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -50,18 +50,19 @@ struct rds_ib_fmr {
 
 /* This is stored as mr->r_trans_private. */
 struct rds_ib_mr {
-   struct rds_ib_device*device;
-   struct rds_ib_mr_pool   *pool;
+   struct rds_ib_device*device;
+   struct rds_ib_mr_pool   *pool;
+   struct rds_ib_connection*ic;
 
-   struct llist_node   llnode;
+   struct llist_node   llnode;
 
/* unmap_list is for freeing */
-   struct list_headunmap_list;
-   unsigned intremap_count;
+   struct list_headunmap_list;
+   unsigned intremap_count;
 
-   struct scatterlist  *sg;
-   unsigned intsg_len;
-   int sg_dma_len;
+   struct scatterlist  *sg;
+   unsigned intsg_len;
+   int sg_dma_len;
 
union {
struct rds_ib_fmr   fmr;
-- 
1.9.1



[net-next][PATCH v2 00/13] RDS: Major clean-up with couple of new features for 4.6

2016-02-27 Thread Santosh Shilimkar
v2:
Dropped module parameter from [PATCH 11/13] as suggested by David Miller

Series is generated against net-next but also applies against Linus's tip
cleanly. Entire patchset is available at below git tree:

  git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git 
for_4.6/net-next/rds_v2

The diff-stat looks bit scary since almost ~4K lines of code is
getting removed. Brief summary of the series:

- Drop the stale iWARP support:
RDS iWarp support code has become stale and non testable for
sometime.  As discussed and agreed earlier on list, am dropping
its support for good. If new iWarp user(s) shows up in future,
the plan is to adapt existing IB RDMA with special sink case.
- RDS gets SO_TIMESTAMP support
- Long due RDS maintainer entry gets updated
- Some RDS IB code refactoring towards new FastReg Memory registration (FRMR)
- Lastly the initial support for FRMR

RDS IB RDMA performance with FRMR is not yet as good as FMR and I do have
some patches in progress to address that. But they are not ready for 4.6
so I left them out of this series. 

Also am keeping eye on new CQ API adaptations like other ULPs doing and
will try to adapt RDS for the same most likely in 4.7+ timeframe. 

Santosh Shilimkar (12):
  RDS: Drop stale iWARP RDMA transport
  RDS: Add support for SO_TIMESTAMP for incoming messages
  MAINTAINERS: update RDS entry
  RDS: IB: Remove the RDS_IB_SEND_OP dependency
  RDS: IB: Re-organise ibmr code
  RDS: IB: create struct rds_ib_fmr
  RDS: IB: move FMR code to its own file
  RDS: IB: add connection info to ibmr
  RDS: IB: handle the RDMA CM time wait event
  RDS: IB: add mr reused stats
  RDS: IB: add Fastreg MR (FRMR) detection support
  RDS: IB: allocate extra space on queues for FRMR support

Avinash Repaka (1):
  RDS: IB: Support Fastreg MR (FRMR) memory registration mode

 Documentation/networking/rds.txt |   4 +-
 MAINTAINERS  |   6 +-
 net/rds/Kconfig  |   7 +-
 net/rds/Makefile |   4 +-
 net/rds/af_rds.c |  26 ++
 net/rds/ib.c |  47 +-
 net/rds/ib.h |  37 +-
 net/rds/ib_cm.c  |  59 ++-
 net/rds/ib_fmr.c | 248 ++
 net/rds/ib_frmr.c| 376 +++
 net/rds/ib_mr.h  | 148 ++
 net/rds/ib_rdma.c| 495 ++--
 net/rds/ib_send.c|   6 +-
 net/rds/ib_stats.c   |   2 +
 net/rds/iw.c | 312 -
 net/rds/iw.h | 398 
 net/rds/iw_cm.c  | 769 --
 net/rds/iw_rdma.c| 837 -
 net/rds/iw_recv.c| 904 
 net/rds/iw_ring.c| 169 ---
 net/rds/iw_send.c| 981 ---
 net/rds/iw_stats.c   |  95 
 net/rds/iw_sysctl.c  | 123 -
 net/rds/rdma_transport.c |  21 +-
 net/rds/rdma_transport.h |   5 -
 net/rds/rds.h|   1 +
 net/rds/recv.c   |  20 +-
 27 files changed, 1065 insertions(+), 5035 deletions(-)
 create mode 100644 net/rds/ib_fmr.c
 create mode 100644 net/rds/ib_frmr.c
 create mode 100644 net/rds/ib_mr.h
 delete mode 100644 net/rds/iw.c
 delete mode 100644 net/rds/iw.h
 delete mode 100644 net/rds/iw_cm.c
 delete mode 100644 net/rds/iw_rdma.c
 delete mode 100644 net/rds/iw_recv.c
 delete mode 100644 net/rds/iw_ring.c
 delete mode 100644 net/rds/iw_send.c
 delete mode 100644 net/rds/iw_stats.c
 delete mode 100644 net/rds/iw_sysctl.c

-- 
1.9.1



[net-next][PATCH v2 13/13] RDS: IB: Support Fastreg MR (FRMR) memory registration mode

2016-02-27 Thread Santosh Shilimkar
From: Avinash Repaka 

Fastreg MR(FRMR) is another method with which one can
register memory to HCA. Some of the newer HCAs supports only fastreg
mr mode, so we need to add support for it to have RDS functional
on them.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Avinash Repaka 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/Makefile  |   2 +-
 net/rds/ib.h  |   1 +
 net/rds/ib_cm.c   |   7 +-
 net/rds/ib_frmr.c | 376 ++
 net/rds/ib_mr.h   |  24 
 net/rds/ib_rdma.c |  17 ++-
 6 files changed, 422 insertions(+), 5 deletions(-)
 create mode 100644 net/rds/ib_frmr.c

diff --git a/net/rds/Makefile b/net/rds/Makefile
index bcf5591..0e72bec 100644
--- a/net/rds/Makefile
+++ b/net/rds/Makefile
@@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o 
message.o   \
 obj-$(CONFIG_RDS_RDMA) += rds_rdma.o
 rds_rdma-y :=  rdma_transport.o \
ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
-   ib_sysctl.o ib_rdma.o ib_fmr.o
+   ib_sysctl.o ib_rdma.o ib_fmr.o ib_frmr.o
 
 
 obj-$(CONFIG_RDS_TCP) += rds_tcp.o
diff --git a/net/rds/ib.h b/net/rds/ib.h
index eeb0d6c..627fb79 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -349,6 +349,7 @@ int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, 
__be32 ipaddr);
 void rds_ib_add_conn(struct rds_ib_device *rds_ibdev, struct rds_connection 
*conn);
 void rds_ib_remove_conn(struct rds_ib_device *rds_ibdev, struct rds_connection 
*conn);
 void rds_ib_destroy_nodev_conns(void);
+void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc);
 
 /* ib_recv.c */
 int rds_ib_recv_init(void);
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 83f4673..8764970 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -249,7 +249,12 @@ static void poll_scq(struct rds_ib_connection *ic, struct 
ib_cq *cq,
 (unsigned long long)wc->wr_id, wc->status,
 wc->byte_len, be32_to_cpu(wc->ex.imm_data));
 
-   rds_ib_send_cqe_handler(ic, wc);
+   if (wc->wr_id <= ic->i_send_ring.w_nr ||
+   wc->wr_id == RDS_IB_ACK_WR_ID)
+   rds_ib_send_cqe_handler(ic, wc);
+   else
+   rds_ib_mr_cqe_handler(ic, wc);
+
}
}
 }
diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
new file mode 100644
index 000..93ff038
--- /dev/null
+++ b/net/rds/ib_frmr.c
@@ -0,0 +1,376 @@
+/*
+ * Copyright (c) 2016 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include "ib_mr.h"
+
+static struct rds_ib_mr *rds_ib_alloc_frmr(struct rds_ib_device *rds_ibdev,
+  int npages)
+{
+   struct rds_ib_mr_pool *pool;
+   struct rds_ib_mr *ibmr = NULL;
+   struct rds_ib_frmr *frmr;
+   int err = 0;
+
+   if (npages <= RDS_MR_8K_MSG_SIZE)
+   pool = rds_ibdev->mr_8k_pool;
+   else
+   pool = rds_ibdev->mr_1m_pool;
+
+   ibmr = rds_ib_try_reuse_ibmr(pool);
+   if (ibmr)
+   return ibmr;
+
+   ibmr = kzalloc_node(sizeof(*ibmr), GFP_KERNEL,
+   rdsibdev_to_node(rds_ibdev));
+   if (!ibmr) {
+   err = -ENOMEM;
+   goto out_no_cigar;
+   }
+
+   frmr = &ibmr->u.frmr;
+   frmr->mr = ib_alloc_mr(rds_ibdev->pd, IB_MR_TYPE_MEM_REG,
+pool->fmr_att

[net-next][PATCH v2 12/13] RDS: IB: allocate extra space on queues for FRMR support

2016-02-27 Thread Santosh Shilimkar
Fastreg MR(FRMR) memory registration and invalidation makes use
of work request and completion queues for its operation. Patch
allocates extra queue space towards these operation(s).

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h|  4 
 net/rds/ib_cm.c | 16 
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index c5eddc2..eeb0d6c 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -14,6 +14,7 @@
 
 #define RDS_IB_DEFAULT_RECV_WR 1024
 #define RDS_IB_DEFAULT_SEND_WR 256
+#define RDS_IB_DEFAULT_FR_WR   512
 
 #define RDS_IB_DEFAULT_RETRY_COUNT 2
 
@@ -122,6 +123,9 @@ struct rds_ib_connection {
struct ib_wci_send_wc[RDS_IB_WC_MAX];
struct ib_wci_recv_wc[RDS_IB_WC_MAX];
 
+   /* To control the number of wrs from fastreg */
+   atomic_ti_fastreg_wrs;
+
/* interrupt handling */
struct tasklet_struct   i_send_tasklet;
struct tasklet_struct   i_recv_tasklet;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 7f68abc..83f4673 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -363,7 +363,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
struct ib_qp_init_attr attr;
struct ib_cq_init_attr cq_attr = {};
struct rds_ib_device *rds_ibdev;
-   int ret;
+   int ret, fr_queue_space;
 
/*
 * It's normal to see a null device if an incoming connection races
@@ -373,6 +373,12 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
if (!rds_ibdev)
return -EOPNOTSUPP;
 
+   /* The fr_queue_space is currently set to 512, to add extra space on
+* completion queue and send queue. This extra space is used for FRMR
+* registration and invalidation work requests
+*/
+   fr_queue_space = (rds_ibdev->use_fastreg ? RDS_IB_DEFAULT_FR_WR : 0);
+
/* add the conn now so that connection establishment has the dev */
rds_ib_add_conn(rds_ibdev, conn);
 
@@ -384,7 +390,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
/* Protection domain and memory range */
ic->i_pd = rds_ibdev->pd;
 
-   cq_attr.cqe = ic->i_send_ring.w_nr + 1;
+   cq_attr.cqe = ic->i_send_ring.w_nr + fr_queue_space + 1;
 
ic->i_send_cq = ib_create_cq(dev, rds_ib_cq_comp_handler_send,
 rds_ib_cq_event_handler, conn,
@@ -424,7 +430,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
attr.event_handler = rds_ib_qp_event_handler;
attr.qp_context = conn;
/* + 1 to allow for the single ack message */
-   attr.cap.max_send_wr = ic->i_send_ring.w_nr + 1;
+   attr.cap.max_send_wr = ic->i_send_ring.w_nr + fr_queue_space + 1;
attr.cap.max_recv_wr = ic->i_recv_ring.w_nr + 1;
attr.cap.max_send_sge = rds_ibdev->max_sge;
attr.cap.max_recv_sge = RDS_IB_RECV_SGE;
@@ -432,6 +438,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
attr.qp_type = IB_QPT_RC;
attr.send_cq = ic->i_send_cq;
attr.recv_cq = ic->i_recv_cq;
+   atomic_set(&ic->i_fastreg_wrs, RDS_IB_DEFAULT_FR_WR);
 
/*
 * XXX this can fail if max_*_wr is too large?  Are we supposed
@@ -751,7 +758,8 @@ void rds_ib_conn_shutdown(struct rds_connection *conn)
 */
wait_event(rds_ib_ring_empty_wait,
   rds_ib_ring_empty(&ic->i_recv_ring) &&
-  (atomic_read(&ic->i_signaled_sends) == 0));
+  (atomic_read(&ic->i_signaled_sends) == 0) &&
+  (atomic_read(&ic->i_fastreg_wrs) == 
RDS_IB_DEFAULT_FR_WR));
tasklet_kill(&ic->i_send_tasklet);
tasklet_kill(&ic->i_recv_tasklet);
 
-- 
1.9.1



[net-next][PATCH v2 11/13] RDS: IB: add Fastreg MR (FRMR) detection support

2016-02-27 Thread Santosh Shilimkar
Discovere Fast Memmory Registration support using IB device
IB_DEVICE_MEM_MGT_EXTENSIONS. Certain HCA might support just FRMR
or FMR or both FMR and FRWR. In case both mr type are supported,
default FMR is used.

Default MR is still kept as FMR against what everyone else
is following. Default will be changed to FRMR once the
RDS performance with FRMR is comparable with FMR. The
work is in progress for the same.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
v2: Dropped the module parameter as suggested by David Miller

 net/rds/ib.c| 10 ++
 net/rds/ib.h|  4 
 net/rds/ib_mr.h |  1 +
 3 files changed, 15 insertions(+)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index bb32cb9..b5342fd 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -140,6 +140,12 @@ static void rds_ib_add_one(struct ib_device *device)
rds_ibdev->max_wrs = device->attrs.max_qp_wr;
rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE);
 
+   rds_ibdev->has_fr = (device->attrs.device_cap_flags &
+ IB_DEVICE_MEM_MGT_EXTENSIONS);
+   rds_ibdev->has_fmr = (device->alloc_fmr && device->dealloc_fmr &&
+   device->map_phys_fmr && device->unmap_fmr);
+   rds_ibdev->use_fastreg = (rds_ibdev->has_fr && !rds_ibdev->has_fmr);
+
rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32;
rds_ibdev->max_1m_mrs = device->attrs.max_mr ?
min_t(unsigned int, (device->attrs.max_mr / 2),
@@ -178,6 +184,10 @@ static void rds_ib_add_one(struct ib_device *device)
 rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs,
 rds_ibdev->max_8k_mrs);
 
+   pr_info("RDS/IB: %s: %s supported and preferred\n",
+   device->name,
+   rds_ibdev->use_fastreg ? "FRMR" : "FMR");
+
INIT_LIST_HEAD(&rds_ibdev->ipaddr_list);
INIT_LIST_HEAD(&rds_ibdev->conn_list);
 
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 62fe7d5..c5eddc2 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -200,6 +200,10 @@ struct rds_ib_device {
struct list_headconn_list;
struct ib_device*dev;
struct ib_pd*pd;
+   boolhas_fmr;
+   boolhas_fr;
+   booluse_fastreg;
+
unsigned intmax_mrs;
struct rds_ib_mr_pool   *mr_1m_pool;
struct rds_ib_mr_pool   *mr_8k_pool;
diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index add7725..2f9b9c3 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -93,6 +93,7 @@ struct rds_ib_mr_pool {
 extern struct workqueue_struct *rds_ib_mr_wq;
 extern unsigned int rds_ib_mr_1m_pool_size;
 extern unsigned int rds_ib_mr_8k_pool_size;
+extern bool prefer_frmr;
 
 struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *rds_dev,
 int npages);
-- 
1.9.1



[net-next][PATCH 03/13] MAINTAINERS: update RDS entry

2016-02-26 Thread Santosh Shilimkar
Acked-by: Chien Yen 
Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 MAINTAINERS | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 355e1c8..9d79bea 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9081,10 +9081,14 @@ S:  Maintained
 F: drivers/net/ethernet/rdc/r6040.c
 
 RDS - RELIABLE DATAGRAM SOCKETS
-M: Chien Yen 
+M: Santosh Shilimkar 
+L: net...@vger.kernel.org
+L: linux-r...@vger.kernel.org
 L: rds-de...@oss.oracle.com (moderated for non-subscribers)
+W: https://oss.oracle.com/projects/rds/
 S: Supported
 F: net/rds/
+F: Documentation/networking/rds.txt
 
 READ-COPY UPDATE (RCU)
 M: "Paul E. McKenney" 
-- 
1.9.1



[net-next][PATCH 01/13] RDS: Drop stale iWARP RDMA transport

2016-02-26 Thread Santosh Shilimkar
RDS iWarp support code has become stale and non testable. As
indicated earlier, am dropping the support for it.

If new iWarp user(s) shows up in future, we can adapat the RDS IB
transprt for the special RDMA READ sink case. iWarp needs an MR
for the RDMA READ sink.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 Documentation/networking/rds.txt |   4 +-
 net/rds/Kconfig  |   7 +-
 net/rds/Makefile |   4 +-
 net/rds/iw.c | 312 -
 net/rds/iw.h | 398 
 net/rds/iw_cm.c  | 769 --
 net/rds/iw_rdma.c| 837 -
 net/rds/iw_recv.c| 904 
 net/rds/iw_ring.c| 169 ---
 net/rds/iw_send.c| 981 ---
 net/rds/iw_stats.c   |  95 
 net/rds/iw_sysctl.c  | 123 -
 net/rds/rdma_transport.c |  13 +-
 net/rds/rdma_transport.h |   5 -
 14 files changed, 7 insertions(+), 4614 deletions(-)
 delete mode 100644 net/rds/iw.c
 delete mode 100644 net/rds/iw.h
 delete mode 100644 net/rds/iw_cm.c
 delete mode 100644 net/rds/iw_rdma.c
 delete mode 100644 net/rds/iw_recv.c
 delete mode 100644 net/rds/iw_ring.c
 delete mode 100644 net/rds/iw_send.c
 delete mode 100644 net/rds/iw_stats.c
 delete mode 100644 net/rds/iw_sysctl.c

diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt
index e1a3d59..9d219d8 100644
--- a/Documentation/networking/rds.txt
+++ b/Documentation/networking/rds.txt
@@ -19,9 +19,7 @@ to N*N if you use a connection-oriented socket transport like 
TCP.
 
 RDS is not Infiniband-specific; it was designed to support different
 transports.  The current implementation used to support RDS over TCP as well
-as IB. Work is in progress to support RDS over iWARP, and using DCE to
-guarantee no dropped packets on Ethernet, it may be possible to use RDS over
-UDP in the future.
+as IB.
 
 The high-level semantics of RDS from the application's point of view are
 
diff --git a/net/rds/Kconfig b/net/rds/Kconfig
index f2c670b..bffde4b 100644
--- a/net/rds/Kconfig
+++ b/net/rds/Kconfig
@@ -4,14 +4,13 @@ config RDS
depends on INET
---help---
  The RDS (Reliable Datagram Sockets) protocol provides reliable,
- sequenced delivery of datagrams over Infiniband, iWARP,
- or TCP.
+ sequenced delivery of datagrams over Infiniband or TCP.
 
 config RDS_RDMA
-   tristate "RDS over Infiniband and iWARP"
+   tristate "RDS over Infiniband"
depends on RDS && INFINIBAND && INFINIBAND_ADDR_TRANS
---help---
- Allow RDS to use Infiniband and iWARP as a transport.
+ Allow RDS to use Infiniband as a transport.
  This transport supports RDMA operations.
 
 config RDS_TCP
diff --git a/net/rds/Makefile b/net/rds/Makefile
index 56d3f60..19e5485 100644
--- a/net/rds/Makefile
+++ b/net/rds/Makefile
@@ -6,9 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o 
message.o   \
 obj-$(CONFIG_RDS_RDMA) += rds_rdma.o
 rds_rdma-y :=  rdma_transport.o \
ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
-   ib_sysctl.o ib_rdma.o \
-   iw.o iw_cm.o iw_recv.o iw_ring.o iw_send.o iw_stats.o \
-   iw_sysctl.o iw_rdma.o
+   ib_sysctl.o ib_rdma.o
 
 
 obj-$(CONFIG_RDS_TCP) += rds_tcp.o
diff --git a/net/rds/iw.c b/net/rds/iw.c
deleted file mode 100644
index f4a9fff..000
diff --git a/net/rds/iw.h b/net/rds/iw.h
deleted file mode 100644
index 5af01d1..000
diff --git a/net/rds/iw_cm.c b/net/rds/iw_cm.c
deleted file mode 100644
index aea4c91..000
diff --git a/net/rds/iw_rdma.c b/net/rds/iw_rdma.c
deleted file mode 100644
index b09a40c..000
diff --git a/net/rds/iw_recv.c b/net/rds/iw_recv.c
deleted file mode 100644
index a66d179..000
diff --git a/net/rds/iw_ring.c b/net/rds/iw_ring.c
deleted file mode 100644
index da8e3b6..000
diff --git a/net/rds/iw_send.c b/net/rds/iw_send.c
deleted file mode 100644
index e20bd50..000
diff --git a/net/rds/iw_stats.c b/net/rds/iw_stats.c
deleted file mode 100644
index 5fe67f6..000
diff --git a/net/rds/iw_sysctl.c b/net/rds/iw_sysctl.c
deleted file mode 100644
index 139239d..000
diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index 9c1fed8..4f4b3d8 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -49,9 +49,7 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
rdsdebug("conn %p id %p handling event %u (%s)\n", conn, cm_id,
 event->event, rdma_event_msg(event->event));
 
-   if (cm_id->device->node_type == RDMA_NODE_RNIC)
-   trans = &

[net-next][PATCH 02/13] RDS: Add support for SO_TIMESTAMP for incoming messages

2016-02-26 Thread Santosh Shilimkar
The SO_TIMESTAMP generates time stamp for each incoming RDS messages
User app can enable it by using SO_TIMESTAMP setsocketopt() at
SOL_SOCKET level. CMSG data of cmsg type SO_TIMESTAMP contains the
time stamp in struct timeval format.

Reviewed-by: Sowmini Varadhan 
Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/af_rds.c | 26 ++
 net/rds/rds.h|  1 +
 net/rds/recv.c   | 20 ++--
 3 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index b5476aeb..6beaeb1 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -277,6 +277,27 @@ static int rds_set_transport(struct rds_sock *rs, char 
__user *optval,
return rs->rs_transport ? 0 : -ENOPROTOOPT;
 }
 
+static int rds_enable_recvtstamp(struct sock *sk, char __user *optval,
+int optlen)
+{
+   int val, valbool;
+
+   if (optlen != sizeof(int))
+   return -EFAULT;
+
+   if (get_user(val, (int __user *)optval))
+   return -EFAULT;
+
+   valbool = val ? 1 : 0;
+
+   if (valbool)
+   sock_set_flag(sk, SOCK_RCVTSTAMP);
+   else
+   sock_reset_flag(sk, SOCK_RCVTSTAMP);
+
+   return 0;
+}
+
 static int rds_setsockopt(struct socket *sock, int level, int optname,
  char __user *optval, unsigned int optlen)
 {
@@ -312,6 +333,11 @@ static int rds_setsockopt(struct socket *sock, int level, 
int optname,
ret = rds_set_transport(rs, optval, optlen);
release_sock(sock->sk);
break;
+   case SO_TIMESTAMP:
+   lock_sock(sock->sk);
+   ret = rds_enable_recvtstamp(sock->sk, optval, optlen);
+   release_sock(sock->sk);
+   break;
default:
ret = -ENOPROTOOPT;
}
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 0e2797b..80256b0 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -222,6 +222,7 @@ struct rds_incoming {
__be32  i_saddr;
 
rds_rdma_cookie_t   i_rdma_cookie;
+   struct timeval  i_rx_tstamp;
 };
 
 struct rds_mr {
diff --git a/net/rds/recv.c b/net/rds/recv.c
index a00462b..c0be1ec 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -35,6 +35,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "rds.h"
 
@@ -46,6 +48,8 @@ void rds_inc_init(struct rds_incoming *inc, struct 
rds_connection *conn,
inc->i_conn = conn;
inc->i_saddr = saddr;
inc->i_rdma_cookie = 0;
+   inc->i_rx_tstamp.tv_sec = 0;
+   inc->i_rx_tstamp.tv_usec = 0;
 }
 EXPORT_SYMBOL_GPL(rds_inc_init);
 
@@ -228,6 +232,8 @@ void rds_recv_incoming(struct rds_connection *conn, __be32 
saddr, __be32 daddr,
rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong,
  be32_to_cpu(inc->i_hdr.h_len),
  inc->i_hdr.h_dport);
+   if (sock_flag(sk, SOCK_RCVTSTAMP))
+   do_gettimeofday(&inc->i_rx_tstamp);
rds_inc_addref(inc);
list_add_tail(&inc->i_item, &rs->rs_recv_queue);
__rds_wake_sk_sleep(sk);
@@ -381,7 +387,8 @@ static int rds_notify_cong(struct rds_sock *rs, struct 
msghdr *msghdr)
 /*
  * Receive any control messages.
  */
-static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg)
+static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg,
+struct rds_sock *rs)
 {
int ret = 0;
 
@@ -392,6 +399,15 @@ static int rds_cmsg_recv(struct rds_incoming *inc, struct 
msghdr *msg)
return ret;
}
 
+   if ((inc->i_rx_tstamp.tv_sec != 0) &&
+   sock_flag(rds_rs_to_sk(rs), SOCK_RCVTSTAMP)) {
+   ret = put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMP,
+  sizeof(struct timeval),
+  &inc->i_rx_tstamp);
+   if (ret)
+   return ret;
+   }
+
return 0;
 }
 
@@ -474,7 +490,7 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, 
size_t size,
msg->msg_flags |= MSG_TRUNC;
}
 
-   if (rds_cmsg_recv(inc, msg)) {
+   if (rds_cmsg_recv(inc, msg, rs)) {
ret = -EFAULT;
goto out;
}
-- 
1.9.1



[net-next][PATCH 05/13] RDS: IB: Re-organise ibmr code

2016-02-26 Thread Santosh Shilimkar
No functional changes. This is in preperation towards adding
fastreg memory resgitration support.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/Makefile  |   2 +-
 net/rds/ib.c  |  37 +++---
 net/rds/ib.h  |  25 +---
 net/rds/ib_fmr.c  | 217 +++
 net/rds/ib_mr.h   | 109 
 net/rds/ib_rdma.c | 379 +++---
 6 files changed, 422 insertions(+), 347 deletions(-)
 create mode 100644 net/rds/ib_fmr.c
 create mode 100644 net/rds/ib_mr.h

diff --git a/net/rds/Makefile b/net/rds/Makefile
index 19e5485..bcf5591 100644
--- a/net/rds/Makefile
+++ b/net/rds/Makefile
@@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o 
message.o   \
 obj-$(CONFIG_RDS_RDMA) += rds_rdma.o
 rds_rdma-y :=  rdma_transport.o \
ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
-   ib_sysctl.o ib_rdma.o
+   ib_sysctl.o ib_rdma.o ib_fmr.o
 
 
 obj-$(CONFIG_RDS_TCP) += rds_tcp.o
diff --git a/net/rds/ib.c b/net/rds/ib.c
index 9481d55..bb32cb9 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -42,15 +42,16 @@
 
 #include "rds.h"
 #include "ib.h"
+#include "ib_mr.h"
 
-unsigned int rds_ib_fmr_1m_pool_size = RDS_FMR_1M_POOL_SIZE;
-unsigned int rds_ib_fmr_8k_pool_size = RDS_FMR_8K_POOL_SIZE;
+unsigned int rds_ib_mr_1m_pool_size = RDS_MR_1M_POOL_SIZE;
+unsigned int rds_ib_mr_8k_pool_size = RDS_MR_8K_POOL_SIZE;
 unsigned int rds_ib_retry_count = RDS_IB_DEFAULT_RETRY_COUNT;
 
-module_param(rds_ib_fmr_1m_pool_size, int, 0444);
-MODULE_PARM_DESC(rds_ib_fmr_1m_pool_size, " Max number of 1M fmr per HCA");
-module_param(rds_ib_fmr_8k_pool_size, int, 0444);
-MODULE_PARM_DESC(rds_ib_fmr_8k_pool_size, " Max number of 8K fmr per HCA");
+module_param(rds_ib_mr_1m_pool_size, int, 0444);
+MODULE_PARM_DESC(rds_ib_mr_1m_pool_size, " Max number of 1M mr per HCA");
+module_param(rds_ib_mr_8k_pool_size, int, 0444);
+MODULE_PARM_DESC(rds_ib_mr_8k_pool_size, " Max number of 8K mr per HCA");
 module_param(rds_ib_retry_count, int, 0444);
 MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before reporting 
an error");
 
@@ -140,13 +141,13 @@ static void rds_ib_add_one(struct ib_device *device)
rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE);
 
rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32;
-   rds_ibdev->max_1m_fmrs = device->attrs.max_mr ?
+   rds_ibdev->max_1m_mrs = device->attrs.max_mr ?
min_t(unsigned int, (device->attrs.max_mr / 2),
- rds_ib_fmr_1m_pool_size) : rds_ib_fmr_1m_pool_size;
+ rds_ib_mr_1m_pool_size) : rds_ib_mr_1m_pool_size;
 
-   rds_ibdev->max_8k_fmrs = device->attrs.max_mr ?
+   rds_ibdev->max_8k_mrs = device->attrs.max_mr ?
min_t(unsigned int, ((device->attrs.max_mr / 2) * 
RDS_MR_8K_SCALE),
- rds_ib_fmr_8k_pool_size) : rds_ib_fmr_8k_pool_size;
+ rds_ib_mr_8k_pool_size) : rds_ib_mr_8k_pool_size;
 
rds_ibdev->max_initiator_depth = device->attrs.max_qp_init_rd_atom;
rds_ibdev->max_responder_resources = device->attrs.max_qp_rd_atom;
@@ -172,10 +173,10 @@ static void rds_ib_add_one(struct ib_device *device)
goto put_dev;
}
 
-   rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, 
fmr_max_remaps = %d, max_1m_fmrs = %d, max_8k_fmrs = %d\n",
+   rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, 
fmr_max_remaps = %d, max_1m_mrs = %d, max_8k_mrs = %d\n",
 device->attrs.max_fmr, rds_ibdev->max_wrs, rds_ibdev->max_sge,
-rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_fmrs,
-rds_ibdev->max_8k_fmrs);
+rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs,
+rds_ibdev->max_8k_mrs);
 
INIT_LIST_HEAD(&rds_ibdev->ipaddr_list);
INIT_LIST_HEAD(&rds_ibdev->conn_list);
@@ -364,7 +365,7 @@ void rds_ib_exit(void)
rds_ib_sysctl_exit();
rds_ib_recv_exit();
rds_trans_unregister(&rds_ib_transport);
-   rds_ib_fmr_exit();
+   rds_ib_mr_exit();
 }
 
 struct rds_transport rds_ib_transport = {
@@ -400,13 +401,13 @@ int rds_ib_init(void)
 
INIT_LIST_HEAD(&rds_ib_devices);
 
-   ret = rds_ib_fmr_init();
+   ret = rds_ib_mr_init();
if (ret)
goto out;
 
ret = ib_register_client(&rds_ib_client);
if (ret)
-   goto out_fmr_exit;
+   goto out_mr_exit;
 
ret = rds_ib_sysctl_init();
if (ret)
@@ -430,8 +431,8 @@ out_sysctl:
rds_ib_sysctl_exit();
 out_ibreg:
rds_ib_unregister_client();
-out_fmr_exi

[net-next][PATCH 11/13] RDS: IB: add Fastreg MR (FRMR) detection support

2016-02-26 Thread Santosh Shilimkar
Discovere Fast Memmory Registration support using IB device
IB_DEVICE_MEM_MGT_EXTENSIONS. Certain HCA might support just FRMR
or FMR or both FMR and FRWR. In case both mr type are supported,
default FMR is used. Using module parameter 'prefer_frmr',
user can choose its preferred MR method for RDS. Ofcourse the
module parameter has no effect if the HCA support only FRMR
or only FRMR.

Default MR is still kept as FMR against what everyone else
is following. Default will be changed to FRMR once the
RDS performance with FRMR is comparable with FMR. The
work is in progress for the same.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.c| 14 ++
 net/rds/ib.h|  4 
 net/rds/ib_mr.h |  1 +
 3 files changed, 19 insertions(+)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index bb32cb9..68c94b0 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -47,6 +47,7 @@
 unsigned int rds_ib_mr_1m_pool_size = RDS_MR_1M_POOL_SIZE;
 unsigned int rds_ib_mr_8k_pool_size = RDS_MR_8K_POOL_SIZE;
 unsigned int rds_ib_retry_count = RDS_IB_DEFAULT_RETRY_COUNT;
+bool prefer_frmr;
 
 module_param(rds_ib_mr_1m_pool_size, int, 0444);
 MODULE_PARM_DESC(rds_ib_mr_1m_pool_size, " Max number of 1M mr per HCA");
@@ -54,6 +55,8 @@ module_param(rds_ib_mr_8k_pool_size, int, 0444);
 MODULE_PARM_DESC(rds_ib_mr_8k_pool_size, " Max number of 8K mr per HCA");
 module_param(rds_ib_retry_count, int, 0444);
 MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before reporting 
an error");
+module_param(prefer_frmr, bool, 0444);
+MODULE_PARM_DESC(prefer_frmr, "Preferred MR method if both FMR and FRMR 
supported");
 
 /*
  * we have a clumsy combination of RCU and a rwsem protecting this list
@@ -140,6 +143,13 @@ static void rds_ib_add_one(struct ib_device *device)
rds_ibdev->max_wrs = device->attrs.max_qp_wr;
rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE);
 
+   rds_ibdev->has_fr = (device->attrs.device_cap_flags &
+ IB_DEVICE_MEM_MGT_EXTENSIONS);
+   rds_ibdev->has_fmr = (device->alloc_fmr && device->dealloc_fmr &&
+   device->map_phys_fmr && device->unmap_fmr);
+   rds_ibdev->use_fastreg = (rds_ibdev->has_fr &&
+(!rds_ibdev->has_fmr || prefer_frmr));
+
rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32;
rds_ibdev->max_1m_mrs = device->attrs.max_mr ?
min_t(unsigned int, (device->attrs.max_mr / 2),
@@ -178,6 +188,10 @@ static void rds_ib_add_one(struct ib_device *device)
 rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs,
 rds_ibdev->max_8k_mrs);
 
+   pr_info("RDS/IB: %s: %s supported and preferred\n",
+   device->name,
+   rds_ibdev->use_fastreg ? "FRMR" : "FMR");
+
INIT_LIST_HEAD(&rds_ibdev->ipaddr_list);
INIT_LIST_HEAD(&rds_ibdev->conn_list);
 
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 62fe7d5..c5eddc2 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -200,6 +200,10 @@ struct rds_ib_device {
struct list_headconn_list;
struct ib_device*dev;
struct ib_pd*pd;
+   boolhas_fmr;
+   boolhas_fr;
+   booluse_fastreg;
+
unsigned intmax_mrs;
struct rds_ib_mr_pool   *mr_1m_pool;
struct rds_ib_mr_pool   *mr_8k_pool;
diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index add7725..2f9b9c3 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -93,6 +93,7 @@ struct rds_ib_mr_pool {
 extern struct workqueue_struct *rds_ib_mr_wq;
 extern unsigned int rds_ib_mr_1m_pool_size;
 extern unsigned int rds_ib_mr_8k_pool_size;
+extern bool prefer_frmr;
 
 struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *rds_dev,
 int npages);
-- 
1.9.1



[net-next][PATCH 09/13] RDS: IB: handle the RDMA CM time wait event

2016-02-26 Thread Santosh Shilimkar
Drop the RDS connection on RDMA_CM_EVENT_TIMEWAIT_EXIT so that
it can reconnect and resume.

While testing fastreg, this error happened in couple of tests but
was getting un-noticed.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/rdma_transport.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index 4f4b3d8..7220beb 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -117,6 +117,14 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
rds_conn_drop(conn);
break;
 
+   case RDMA_CM_EVENT_TIMEWAIT_EXIT:
+   if (conn) {
+   pr_info("RDS: RDMA_CM_EVENT_TIMEWAIT_EXIT event: 
dropping connection %pI4->%pI4\n",
+   &conn->c_laddr, &conn->c_faddr);
+   rds_conn_drop(conn);
+   }
+   break;
+
default:
/* things like device disconnect? */
printk(KERN_ERR "RDS: unknown event %u (%s)!\n",
-- 
1.9.1



[net-next][PATCH 00/13] RDS: Major clean-up with couple of new features for 4.6

2016-02-26 Thread Santosh Shilimkar
Series is generated against net-next but also applies against Linus's tip
cleanly. The diff-stat looks bit scary since almost ~4K lines of code is
getting removed.

Brief summary of the series:

- Drop the stale iWARP support:
RDS iWarp support code has become stale and non testable for
sometime.  As discussed and agreed earlier on list [1], am dropping
its support for good. If new iWarp user(s) shows up in future,
the plan is to adapt existing IB RDMA with special sink case.
- RDS gets SO_TIMESTAMP support
- Long due RDS maintainer entry gets updated
- Some RDS IB code refactoring towards new FastReg Memory registration (FRMR)
- Lastly the initial support for FRMR

RDS IB RDMA performance with FRMR is not yet as good as FMR and I do have
some patches in progress to address that. But they are not ready for 4.6
so I left them out of this series. 

Also am keeping eye on new CQ API adaptations like other ULPs doing and
will try to adapt RDS for the same most likely in 4.7 timeframe. 

Entire patchset is available below git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git 
for_4.6/net-next/rds

Feedback/comments welcome !!

Santosh Shilimkar (12):
  RDS: Drop stale iWARP RDMA transport
  RDS: Add support for SO_TIMESTAMP for incoming messages
  MAINTAINERS: update RDS entry
  RDS: IB: Remove the RDS_IB_SEND_OP dependency
  RDS: IB: Re-organise ibmr code
  RDS: IB: create struct rds_ib_fmr
  RDS: IB: move FMR code to its own file
  RDS: IB: add connection info to ibmr
  RDS: IB: handle the RDMA CM time wait event
  RDS: IB: add mr reused stats
  RDS: IB: add Fastreg MR (FRMR) detection support
  RDS: IB: allocate extra space on queues for FRMR support

Avinash Repaka (1):
  RDS: IB: Support Fastreg MR (FRMR) memory registration mode

 Documentation/networking/rds.txt |   4 +-
 MAINTAINERS  |   6 +-
 net/rds/Kconfig  |   7 +-
 net/rds/Makefile |   4 +-
 net/rds/af_rds.c |  26 ++
 net/rds/ib.c |  51 +-
 net/rds/ib.h |  37 +-
 net/rds/ib_cm.c  |  59 ++-
 net/rds/ib_fmr.c | 248 ++
 net/rds/ib_frmr.c| 376 +++
 net/rds/ib_mr.h  | 148 ++
 net/rds/ib_rdma.c| 492 ++--
 net/rds/ib_send.c|   6 +-
 net/rds/ib_stats.c   |   2 +
 net/rds/iw.c | 312 -
 net/rds/iw.h | 398 
 net/rds/iw_cm.c  | 769 --
 net/rds/iw_rdma.c| 837 -
 net/rds/iw_recv.c| 904 
 net/rds/iw_ring.c| 169 ---
 net/rds/iw_send.c| 981 ---
 net/rds/iw_stats.c   |  95 
 net/rds/iw_sysctl.c  | 123 -
 net/rds/rdma_transport.c |  21 +-
 net/rds/rdma_transport.h |   5 -
 net/rds/rds.h|   1 +
 net/rds/recv.c   |  20 +-
 27 files changed, 1068 insertions(+), 5033 deletions(-)
 create mode 100644 net/rds/ib_fmr.c
 create mode 100644 net/rds/ib_frmr.c
 create mode 100644 net/rds/ib_mr.h
 delete mode 100644 net/rds/iw.c
 delete mode 100644 net/rds/iw.h
 delete mode 100644 net/rds/iw_cm.c
 delete mode 100644 net/rds/iw_rdma.c
 delete mode 100644 net/rds/iw_recv.c
 delete mode 100644 net/rds/iw_ring.c
 delete mode 100644 net/rds/iw_send.c
 delete mode 100644 net/rds/iw_stats.c
 delete mode 100644 net/rds/iw_sysctl.c


Regards,
Santosh

[1] http://www.spinics.net/lists/linux-rdma/msg30769.html

-- 
1.9.1



[net-next][PATCH 10/13] RDS: IB: add mr reused stats

2016-02-26 Thread Santosh Shilimkar
Add MR reuse statistics to RDS IB transport.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h   | 2 ++
 net/rds/ib_rdma.c  | 7 ++-
 net/rds/ib_stats.c | 2 ++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index c88cb22..62fe7d5 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -259,6 +259,8 @@ struct rds_ib_statistics {
uint64_ts_ib_rdma_mr_1m_pool_flush;
uint64_ts_ib_rdma_mr_1m_pool_wait;
uint64_ts_ib_rdma_mr_1m_pool_depleted;
+   uint64_ts_ib_rdma_mr_8k_reused;
+   uint64_ts_ib_rdma_mr_1m_reused;
uint64_ts_ib_atomic_cswp;
uint64_ts_ib_atomic_fadd;
 };
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index 20ff191..00e9064 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -188,8 +188,13 @@ struct rds_ib_mr *rds_ib_reuse_mr(struct rds_ib_mr_pool 
*pool)
flag = this_cpu_ptr(&clean_list_grace);
set_bit(CLEAN_LIST_BUSY_BIT, flag);
ret = llist_del_first(&pool->clean_list);
-   if (ret)
+   if (ret) {
ibmr = llist_entry(ret, struct rds_ib_mr, llnode);
+   if (pool->pool_type == RDS_IB_MR_8K_POOL)
+   rds_ib_stats_inc(s_ib_rdma_mr_8k_reused);
+   else
+   rds_ib_stats_inc(s_ib_rdma_mr_1m_reused);
+   }
 
clear_bit(CLEAN_LIST_BUSY_BIT, flag);
preempt_enable();
diff --git a/net/rds/ib_stats.c b/net/rds/ib_stats.c
index d77e044..7e78dca 100644
--- a/net/rds/ib_stats.c
+++ b/net/rds/ib_stats.c
@@ -73,6 +73,8 @@ static const char *const rds_ib_stat_names[] = {
"ib_rdma_mr_1m_pool_flush",
"ib_rdma_mr_1m_pool_wait",
"ib_rdma_mr_1m_pool_depleted",
+   "ib_rdma_mr_8k_reused",
+   "ib_rdma_mr_1m_reused",
"ib_atomic_cswp",
"ib_atomic_fadd",
 };
-- 
1.9.1



[net-next][PATCH 13/13] RDS: IB: Support Fastreg MR (FRMR) memory registration mode

2016-02-26 Thread Santosh Shilimkar
From: Avinash Repaka 

Fastreg MR(FRMR) is another method with which one can
register memory to HCA. Some of the newer HCAs supports only fastreg
mr mode, so we need to add support for it to RDS to have RDS functional
on them.

Some of the older HCAs support both FMR and FRMR modes. So to try out
FRMR on older HCAs, one can use module parameter 'prefer_frmr'

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Avinash Repaka 
Signed-off-by: Santosh Shilimkar 
---
RDS IB RDMA performance with FRMR is not yet as good as FMR and I do have
some patches in progress to address that. But they are not ready for 4.6
so I left them out of this series. 

 net/rds/Makefile  |   2 +-
 net/rds/ib.h  |   1 +
 net/rds/ib_cm.c   |   7 +-
 net/rds/ib_frmr.c | 376 ++
 net/rds/ib_mr.h   |  24 
 net/rds/ib_rdma.c |  17 ++-
 6 files changed, 422 insertions(+), 5 deletions(-)
 create mode 100644 net/rds/ib_frmr.c

diff --git a/net/rds/Makefile b/net/rds/Makefile
index bcf5591..0e72bec 100644
--- a/net/rds/Makefile
+++ b/net/rds/Makefile
@@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o 
message.o   \
 obj-$(CONFIG_RDS_RDMA) += rds_rdma.o
 rds_rdma-y :=  rdma_transport.o \
ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
-   ib_sysctl.o ib_rdma.o ib_fmr.o
+   ib_sysctl.o ib_rdma.o ib_fmr.o ib_frmr.o
 
 
 obj-$(CONFIG_RDS_TCP) += rds_tcp.o
diff --git a/net/rds/ib.h b/net/rds/ib.h
index eeb0d6c..627fb79 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -349,6 +349,7 @@ int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, 
__be32 ipaddr);
 void rds_ib_add_conn(struct rds_ib_device *rds_ibdev, struct rds_connection 
*conn);
 void rds_ib_remove_conn(struct rds_ib_device *rds_ibdev, struct rds_connection 
*conn);
 void rds_ib_destroy_nodev_conns(void);
+void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc);
 
 /* ib_recv.c */
 int rds_ib_recv_init(void);
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 83f4673..8764970 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -249,7 +249,12 @@ static void poll_scq(struct rds_ib_connection *ic, struct 
ib_cq *cq,
 (unsigned long long)wc->wr_id, wc->status,
 wc->byte_len, be32_to_cpu(wc->ex.imm_data));
 
-   rds_ib_send_cqe_handler(ic, wc);
+   if (wc->wr_id <= ic->i_send_ring.w_nr ||
+   wc->wr_id == RDS_IB_ACK_WR_ID)
+   rds_ib_send_cqe_handler(ic, wc);
+   else
+   rds_ib_mr_cqe_handler(ic, wc);
+
}
}
 }
diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
new file mode 100644
index 000..a86de13
--- /dev/null
+++ b/net/rds/ib_frmr.c
@@ -0,0 +1,376 @@
+/*
+ * Copyright (c) 2016 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include "ib_mr.h"
+
+static struct rds_ib_mr *rds_ib_alloc_frmr(struct rds_ib_device *rds_ibdev,
+  int npages)
+{
+   struct rds_ib_mr_pool *pool;
+   struct rds_ib_mr *ibmr = NULL;
+   struct rds_ib_frmr *frmr;
+   int err = 0;
+
+   if (npages <= RDS_MR_8K_MSG_SIZE)
+   pool = rds_ibdev->mr_8k_pool;
+   else
+   pool = rds_ibdev->mr_1m_pool;
+
+   ibmr = rds_ib_try_reuse_ibmr(pool);
+   if (ibmr)
+   retur

[net-next][PATCH 12/13] RDS: IB: allocate extra space on queues for FRMR support

2016-02-26 Thread Santosh Shilimkar
Fastreg MR(FRMR) memory registration and invalidation makes use
of work request and completion queues for its operation. Patch
allocates extra queue space towards these operation(s).

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h|  4 
 net/rds/ib_cm.c | 16 
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index c5eddc2..eeb0d6c 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -14,6 +14,7 @@
 
 #define RDS_IB_DEFAULT_RECV_WR 1024
 #define RDS_IB_DEFAULT_SEND_WR 256
+#define RDS_IB_DEFAULT_FR_WR   512
 
 #define RDS_IB_DEFAULT_RETRY_COUNT 2
 
@@ -122,6 +123,9 @@ struct rds_ib_connection {
struct ib_wci_send_wc[RDS_IB_WC_MAX];
struct ib_wci_recv_wc[RDS_IB_WC_MAX];
 
+   /* To control the number of wrs from fastreg */
+   atomic_ti_fastreg_wrs;
+
/* interrupt handling */
struct tasklet_struct   i_send_tasklet;
struct tasklet_struct   i_recv_tasklet;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 7f68abc..83f4673 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -363,7 +363,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
struct ib_qp_init_attr attr;
struct ib_cq_init_attr cq_attr = {};
struct rds_ib_device *rds_ibdev;
-   int ret;
+   int ret, fr_queue_space;
 
/*
 * It's normal to see a null device if an incoming connection races
@@ -373,6 +373,12 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
if (!rds_ibdev)
return -EOPNOTSUPP;
 
+   /* The fr_queue_space is currently set to 512, to add extra space on
+* completion queue and send queue. This extra space is used for FRMR
+* registration and invalidation work requests
+*/
+   fr_queue_space = (rds_ibdev->use_fastreg ? RDS_IB_DEFAULT_FR_WR : 0);
+
/* add the conn now so that connection establishment has the dev */
rds_ib_add_conn(rds_ibdev, conn);
 
@@ -384,7 +390,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
/* Protection domain and memory range */
ic->i_pd = rds_ibdev->pd;
 
-   cq_attr.cqe = ic->i_send_ring.w_nr + 1;
+   cq_attr.cqe = ic->i_send_ring.w_nr + fr_queue_space + 1;
 
ic->i_send_cq = ib_create_cq(dev, rds_ib_cq_comp_handler_send,
 rds_ib_cq_event_handler, conn,
@@ -424,7 +430,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
attr.event_handler = rds_ib_qp_event_handler;
attr.qp_context = conn;
/* + 1 to allow for the single ack message */
-   attr.cap.max_send_wr = ic->i_send_ring.w_nr + 1;
+   attr.cap.max_send_wr = ic->i_send_ring.w_nr + fr_queue_space + 1;
attr.cap.max_recv_wr = ic->i_recv_ring.w_nr + 1;
attr.cap.max_send_sge = rds_ibdev->max_sge;
attr.cap.max_recv_sge = RDS_IB_RECV_SGE;
@@ -432,6 +438,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
attr.qp_type = IB_QPT_RC;
attr.send_cq = ic->i_send_cq;
attr.recv_cq = ic->i_recv_cq;
+   atomic_set(&ic->i_fastreg_wrs, RDS_IB_DEFAULT_FR_WR);
 
/*
 * XXX this can fail if max_*_wr is too large?  Are we supposed
@@ -751,7 +758,8 @@ void rds_ib_conn_shutdown(struct rds_connection *conn)
 */
wait_event(rds_ib_ring_empty_wait,
   rds_ib_ring_empty(&ic->i_recv_ring) &&
-  (atomic_read(&ic->i_signaled_sends) == 0));
+  (atomic_read(&ic->i_signaled_sends) == 0) &&
+  (atomic_read(&ic->i_fastreg_wrs) == 
RDS_IB_DEFAULT_FR_WR));
tasklet_kill(&ic->i_send_tasklet);
tasklet_kill(&ic->i_recv_tasklet);
 
-- 
1.9.1



[net-next][PATCH 08/13] RDS: IB: add connection info to ibmr

2016-02-26 Thread Santosh Shilimkar
Preperatory patch for FRMR support. From connection info,
we can retrieve cm_id which contains qp handled needed for
work request posting.

We also need to drop the RDS connection on QP error states
where connection handle becomes useful.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_mr.h | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index f5c1fcb..add7725 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -50,18 +50,19 @@ struct rds_ib_fmr {
 
 /* This is stored as mr->r_trans_private. */
 struct rds_ib_mr {
-   struct rds_ib_device*device;
-   struct rds_ib_mr_pool   *pool;
+   struct rds_ib_device*device;
+   struct rds_ib_mr_pool   *pool;
+   struct rds_ib_connection*ic;
 
-   struct llist_node   llnode;
+   struct llist_node   llnode;
 
/* unmap_list is for freeing */
-   struct list_headunmap_list;
-   unsigned intremap_count;
+   struct list_headunmap_list;
+   unsigned intremap_count;
 
-   struct scatterlist  *sg;
-   unsigned intsg_len;
-   int sg_dma_len;
+   struct scatterlist  *sg;
+   unsigned intsg_len;
+   int sg_dma_len;
 
union {
struct rds_ib_fmr   fmr;
-- 
1.9.1



[net-next][PATCH 07/13] RDS: IB: move FMR code to its own file

2016-02-26 Thread Santosh Shilimkar
No functional change.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_fmr.c  | 126 +-
 net/rds/ib_mr.h   |   6 +++
 net/rds/ib_rdma.c | 105 ++---
 3 files changed, 133 insertions(+), 104 deletions(-)

diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c
index 74f2c21..4fe8f4f 100644
--- a/net/rds/ib_fmr.c
+++ b/net/rds/ib_fmr.c
@@ -37,61 +37,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
struct rds_ib_mr_pool *pool;
struct rds_ib_mr *ibmr = NULL;
struct rds_ib_fmr *fmr;
-   int err = 0, iter = 0;
+   int err = 0;
 
if (npages <= RDS_MR_8K_MSG_SIZE)
pool = rds_ibdev->mr_8k_pool;
else
pool = rds_ibdev->mr_1m_pool;
 
-   if (atomic_read(&pool->dirty_count) >= pool->max_items / 10)
-   queue_delayed_work(rds_ib_mr_wq, &pool->flush_worker, 10);
-
-   /* Switch pools if one of the pool is reaching upper limit */
-   if (atomic_read(&pool->dirty_count) >=  pool->max_items * 9 / 10) {
-   if (pool->pool_type == RDS_IB_MR_8K_POOL)
-   pool = rds_ibdev->mr_1m_pool;
-   else
-   pool = rds_ibdev->mr_8k_pool;
-   }
-
-   while (1) {
-   ibmr = rds_ib_reuse_mr(pool);
-   if (ibmr)
-   return ibmr;
-
-   /* No clean MRs - now we have the choice of either
-* allocating a fresh MR up to the limit imposed by the
-* driver, or flush any dirty unused MRs.
-* We try to avoid stalling in the send path if possible,
-* so we allocate as long as we're allowed to.
-*
-* We're fussy with enforcing the FMR limit, though. If the
-* driver tells us we can't use more than N fmrs, we shouldn't
-* start arguing with it
-*/
-   if (atomic_inc_return(&pool->item_count) <= pool->max_items)
-   break;
-
-   atomic_dec(&pool->item_count);
-
-   if (++iter > 2) {
-   if (pool->pool_type == RDS_IB_MR_8K_POOL)
-   rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_depleted);
-   else
-   rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_depleted);
-   return ERR_PTR(-EAGAIN);
-   }
-
-   /* We do have some empty MRs. Flush them out. */
-   if (pool->pool_type == RDS_IB_MR_8K_POOL)
-   rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_wait);
-   else
-   rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_wait);
-   rds_ib_flush_mr_pool(pool, 0, &ibmr);
-   if (ibmr)
-   return ibmr;
-   }
+   ibmr = rds_ib_try_reuse_ibmr(pool);
+   if (ibmr)
+   return ibmr;
 
ibmr = kzalloc_node(sizeof(*ibmr), GFP_KERNEL,
rdsibdev_to_node(rds_ibdev));
@@ -218,3 +173,76 @@ out:
 
return ret;
 }
+
+struct rds_ib_mr *rds_ib_reg_fmr(struct rds_ib_device *rds_ibdev,
+struct scatterlist *sg,
+unsigned long nents,
+u32 *key)
+{
+   struct rds_ib_mr *ibmr = NULL;
+   struct rds_ib_fmr *fmr;
+   int ret;
+
+   ibmr = rds_ib_alloc_fmr(rds_ibdev, nents);
+   if (IS_ERR(ibmr))
+   return ibmr;
+
+   ibmr->device = rds_ibdev;
+   fmr = &ibmr->u.fmr;
+   ret = rds_ib_map_fmr(rds_ibdev, ibmr, sg, nents);
+   if (ret == 0)
+   *key = fmr->fmr->rkey;
+   else
+   rds_ib_free_mr(ibmr, 0);
+
+   return ibmr;
+}
+
+void rds_ib_unreg_fmr(struct list_head *list, unsigned int *nfreed,
+ unsigned long *unpinned, unsigned int goal)
+{
+   struct rds_ib_mr *ibmr, *next;
+   struct rds_ib_fmr *fmr;
+   LIST_HEAD(fmr_list);
+   int ret = 0;
+   unsigned int freed = *nfreed;
+
+   /* String all ib_mr's onto one list and hand them to  ib_unmap_fmr */
+   list_for_each_entry(ibmr, list, unmap_list) {
+   fmr = &ibmr->u.fmr;
+   list_add(&fmr->fmr->list, &fmr_list);
+   }
+
+   ret = ib_unmap_fmr(&fmr_list);
+   if (ret)
+   pr_warn("RDS/IB: FMR invalidation failed (err=%d)\n", ret);
+
+   /* Now we can destroy the DMA mapping and unpin any pages */
+   list_for_each_entry_safe(ibmr, next, list, unmap_list) {
+   fmr = &ibmr->u.fmr;
+   *unpinned += ibmr->sg_len;
+   

[net-next][PATCH 06/13] RDS: IB: create struct rds_ib_fmr

2016-02-26 Thread Santosh Shilimkar
Keep fmr related filed in its own struct. Fastreg MR structure
will be added to the union.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_fmr.c  | 17 ++---
 net/rds/ib_mr.h   | 11 +--
 net/rds/ib_rdma.c | 14 ++
 3 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c
index d4f200d..74f2c21 100644
--- a/net/rds/ib_fmr.c
+++ b/net/rds/ib_fmr.c
@@ -36,6 +36,7 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
 {
struct rds_ib_mr_pool *pool;
struct rds_ib_mr *ibmr = NULL;
+   struct rds_ib_fmr *fmr;
int err = 0, iter = 0;
 
if (npages <= RDS_MR_8K_MSG_SIZE)
@@ -99,15 +100,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
goto out_no_cigar;
}
 
-   ibmr->fmr = ib_alloc_fmr(rds_ibdev->pd,
+   fmr = &ibmr->u.fmr;
+   fmr->fmr = ib_alloc_fmr(rds_ibdev->pd,
(IB_ACCESS_LOCAL_WRITE |
 IB_ACCESS_REMOTE_READ |
 IB_ACCESS_REMOTE_WRITE |
 IB_ACCESS_REMOTE_ATOMIC),
&pool->fmr_attr);
-   if (IS_ERR(ibmr->fmr)) {
-   err = PTR_ERR(ibmr->fmr);
-   ibmr->fmr = NULL;
+   if (IS_ERR(fmr->fmr)) {
+   err = PTR_ERR(fmr->fmr);
+   fmr->fmr = NULL;
pr_warn("RDS/IB: %s failed (err=%d)\n", __func__, err);
goto out_no_cigar;
}
@@ -122,8 +124,8 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
 
 out_no_cigar:
if (ibmr) {
-   if (ibmr->fmr)
-   ib_dealloc_fmr(ibmr->fmr);
+   if (fmr->fmr)
+   ib_dealloc_fmr(fmr->fmr);
kfree(ibmr);
}
atomic_dec(&pool->item_count);
@@ -134,6 +136,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct 
rds_ib_mr *ibmr,
   struct scatterlist *sg, unsigned int nents)
 {
struct ib_device *dev = rds_ibdev->dev;
+   struct rds_ib_fmr *fmr = &ibmr->u.fmr;
struct scatterlist *scat = sg;
u64 io_addr = 0;
u64 *dma_pages;
@@ -190,7 +193,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct 
rds_ib_mr *ibmr,
(dma_addr & PAGE_MASK) + j;
}
 
-   ret = ib_map_phys_fmr(ibmr->fmr, dma_pages, page_cnt, io_addr);
+   ret = ib_map_phys_fmr(fmr->fmr, dma_pages, page_cnt, io_addr);
if (ret)
goto out;
 
diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index d88724f..309ad59 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -43,11 +43,15 @@
 #define RDS_MR_8K_SCALE(256 / (RDS_MR_8K_MSG_SIZE + 1))
 #define RDS_MR_8K_POOL_SIZE(RDS_MR_8K_SCALE * (8192 / 2))
 
+struct rds_ib_fmr {
+   struct ib_fmr   *fmr;
+   u64 *dma;
+};
+
 /* This is stored as mr->r_trans_private. */
 struct rds_ib_mr {
struct rds_ib_device*device;
struct rds_ib_mr_pool   *pool;
-   struct ib_fmr   *fmr;
 
struct llist_node   llnode;
 
@@ -57,8 +61,11 @@ struct rds_ib_mr {
 
struct scatterlist  *sg;
unsigned intsg_len;
-   u64 *dma;
int sg_dma_len;
+
+   union {
+   struct rds_ib_fmr   fmr;
+   } u;
 };
 
 /* Our own little MR pool */
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index c594519..9e608d9 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -334,6 +334,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
 int free_all, struct rds_ib_mr **ibmr_ret)
 {
struct rds_ib_mr *ibmr, *next;
+   struct rds_ib_fmr *fmr;
struct llist_node *clean_nodes;
struct llist_node *clean_tail;
LIST_HEAD(unmap_list);
@@ -395,8 +396,10 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
goto out;
 
/* String all ib_mr's onto one list and hand them to ib_unmap_fmr */
-   list_for_each_entry(ibmr, &unmap_list, unmap_list)
-   list_add(&ibmr->fmr->list, &fmr_list);
+   list_for_each_entry(ibmr, &unmap_list, unmap_list) {
+   fmr = &ibmr->u.fmr;
+   list_add(&fmr->fmr->list, &fmr_list);
+   }
 
ret = ib_unmap_fmr(&fmr_list);
if (ret)
@@ -405,6 +408,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
/* Now we can destroy the DMA mapping and unpin any pages */
list_for_each_entry_safe(ibmr, next, &unmap_list, unmap_list) {
unpinned += ibmr->sg_len;
+

[net-next][PATCH 04/13] RDS: IB: Remove the RDS_IB_SEND_OP dependency

2016-02-26 Thread Santosh Shilimkar
This helps to combine asynchronous fastreg MR completion handler
with send completion handler.

No functional change.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h  |  1 -
 net/rds/ib_cm.c   | 42 +++---
 net/rds/ib_send.c |  6 ++
 3 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index b3fdebb..09cd8e3 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -28,7 +28,6 @@
 #define RDS_IB_RECYCLE_BATCH_COUNT 32
 
 #define RDS_IB_WC_MAX  32
-#define RDS_IB_SEND_OP BIT_ULL(63)
 
 extern struct rw_semaphore rds_ib_devices_lock;
 extern struct list_head rds_ib_devices;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index da5a7fb..7f68abc 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -236,12 +236,10 @@ static void rds_ib_cq_comp_handler_recv(struct ib_cq *cq, 
void *context)
tasklet_schedule(&ic->i_recv_tasklet);
 }
 
-static void poll_cq(struct rds_ib_connection *ic, struct ib_cq *cq,
-   struct ib_wc *wcs,
-   struct rds_ib_ack_state *ack_state)
+static void poll_scq(struct rds_ib_connection *ic, struct ib_cq *cq,
+struct ib_wc *wcs)
 {
-   int nr;
-   int i;
+   int nr, i;
struct ib_wc *wc;
 
while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) {
@@ -251,10 +249,7 @@ static void poll_cq(struct rds_ib_connection *ic, struct 
ib_cq *cq,
 (unsigned long long)wc->wr_id, wc->status,
 wc->byte_len, be32_to_cpu(wc->ex.imm_data));
 
-   if (wc->wr_id & RDS_IB_SEND_OP)
-   rds_ib_send_cqe_handler(ic, wc);
-   else
-   rds_ib_recv_cqe_handler(ic, wc, ack_state);
+   rds_ib_send_cqe_handler(ic, wc);
}
}
 }
@@ -263,14 +258,12 @@ static void rds_ib_tasklet_fn_send(unsigned long data)
 {
struct rds_ib_connection *ic = (struct rds_ib_connection *)data;
struct rds_connection *conn = ic->conn;
-   struct rds_ib_ack_state state;
 
rds_ib_stats_inc(s_ib_tasklet_call);
 
-   memset(&state, 0, sizeof(state));
-   poll_cq(ic, ic->i_send_cq, ic->i_send_wc, &state);
+   poll_scq(ic, ic->i_send_cq, ic->i_send_wc);
ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP);
-   poll_cq(ic, ic->i_send_cq, ic->i_send_wc, &state);
+   poll_scq(ic, ic->i_send_cq, ic->i_send_wc);
 
if (rds_conn_up(conn) &&
(!test_bit(RDS_LL_SEND_FULL, &conn->c_flags) ||
@@ -278,6 +271,25 @@ static void rds_ib_tasklet_fn_send(unsigned long data)
rds_send_xmit(ic->conn);
 }
 
+static void poll_rcq(struct rds_ib_connection *ic, struct ib_cq *cq,
+struct ib_wc *wcs,
+struct rds_ib_ack_state *ack_state)
+{
+   int nr, i;
+   struct ib_wc *wc;
+
+   while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) {
+   for (i = 0; i < nr; i++) {
+   wc = wcs + i;
+   rdsdebug("wc wr_id 0x%llx status %u byte_len %u 
imm_data %u\n",
+(unsigned long long)wc->wr_id, wc->status,
+wc->byte_len, be32_to_cpu(wc->ex.imm_data));
+
+   rds_ib_recv_cqe_handler(ic, wc, ack_state);
+   }
+   }
+}
+
 static void rds_ib_tasklet_fn_recv(unsigned long data)
 {
struct rds_ib_connection *ic = (struct rds_ib_connection *)data;
@@ -291,9 +303,9 @@ static void rds_ib_tasklet_fn_recv(unsigned long data)
rds_ib_stats_inc(s_ib_tasklet_call);
 
memset(&state, 0, sizeof(state));
-   poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, &state);
+   poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, &state);
ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED);
-   poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, &state);
+   poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, &state);
 
if (state.ack_next_valid)
rds_ib_set_ack(ic, state.ack_next, state.ack_required);
diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index eac30bf..f27d2c8 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -195,7 +195,7 @@ void rds_ib_send_init_ring(struct rds_ib_connection *ic)
 
send->s_op = NULL;
 
-   send->s_wr.wr_id = i | RDS_IB_SEND_OP;
+   send->s_wr.wr_id = i;
send->s_wr.sg_list = send->s_sge;
send->s_wr.ex.imm_data = 0;
 
@@ -263,9 +263,7 @@ void rds_ib_send_cqe_handler(struct rds_ib_connection *ic, 
struct ib_wc *wc)
 
oldest = rds_ib_ring_oldest(&ic->i_send_r

Re: [net-next][PATCH 11/13] RDS: IB: add Fastreg MR (FRMR) detection support

2016-02-22 Thread santosh shilimkar

On 2/22/2016 7:38 AM, Bart Van Assche wrote:

On 02/21/16 19:36, David Miller wrote:

From: Santosh Shilimkar 
Date: Sat, 20 Feb 2016 03:30:02 -0800


@@ -54,6 +55,8 @@ module_param(rds_ib_mr_8k_pool_size, int, 0444);
  MODULE_PARM_DESC(rds_ib_mr_8k_pool_size, " Max number of 8K mr per
HCA");
  module_param(rds_ib_retry_count, int, 0444);
  MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before
reporting an error");
+module_param(prefer_frmr, bool, 0444);
+MODULE_PARM_DESC(prefer_frmr, "Preferred MR method if both FMR and
FRMR supported");


Sorry, you're going to have to create a real run time method to configure
this parameter.

I'm strongly against module parameters.

Please don't go into details about why this might be difficult to do,
I'm totally not interested.  Doing things properly is sometimes not
easy, that's life.



Sure Dave. Will drop the parameter. The runtime detection is already
in place. When an HCA hardware supports both FMR and FRMR features,
parameter can be used as over-ride over a default selection.


Hello Santosh,

What is the purpose of the prefer_frmr kernel module parameter ? Is this
a parameter that is useful to RDS users or is its only purpose to allow
developers of the RDS module to test both the FMR and FRMR code paths on
hardware that supports both MR methods ?


Right. Since FRMR in early phase still for RDS, it was useful on HCA's
which supports both registration methods. Its not a deal breaker so
am going to drop the parameter as mentioned above.

Regards,
Santosh


[net-next][PATCH 01/13] RDS: Drop stale iWARP RDMA transport

2016-02-20 Thread Santosh Shilimkar
RDS iWarp support code has become stale and non testable. As
indicated earlier, am dropping the support for it.

If new iWarp user(s) shows up in future, we can adapat the RDS IB
transprt for the special RDMA READ sink case. iWarp needs an MR
for the RDMA READ sink.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 Documentation/networking/rds.txt |   4 +-
 net/rds/Kconfig  |   7 +-
 net/rds/Makefile |   4 +-
 net/rds/iw.c | 312 -
 net/rds/iw.h | 398 
 net/rds/iw_cm.c  | 769 --
 net/rds/iw_rdma.c| 837 -
 net/rds/iw_recv.c| 904 
 net/rds/iw_ring.c| 169 ---
 net/rds/iw_send.c| 981 ---
 net/rds/iw_stats.c   |  95 
 net/rds/iw_sysctl.c  | 123 -
 net/rds/rdma_transport.c |  13 +-
 net/rds/rdma_transport.h |   5 -
 14 files changed, 7 insertions(+), 4614 deletions(-)
 delete mode 100644 net/rds/iw.c
 delete mode 100644 net/rds/iw.h
 delete mode 100644 net/rds/iw_cm.c
 delete mode 100644 net/rds/iw_rdma.c
 delete mode 100644 net/rds/iw_recv.c
 delete mode 100644 net/rds/iw_ring.c
 delete mode 100644 net/rds/iw_send.c
 delete mode 100644 net/rds/iw_stats.c
 delete mode 100644 net/rds/iw_sysctl.c

diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt
index e1a3d59..9d219d8 100644
--- a/Documentation/networking/rds.txt
+++ b/Documentation/networking/rds.txt
@@ -19,9 +19,7 @@ to N*N if you use a connection-oriented socket transport like 
TCP.
 
 RDS is not Infiniband-specific; it was designed to support different
 transports.  The current implementation used to support RDS over TCP as well
-as IB. Work is in progress to support RDS over iWARP, and using DCE to
-guarantee no dropped packets on Ethernet, it may be possible to use RDS over
-UDP in the future.
+as IB.
 
 The high-level semantics of RDS from the application's point of view are
 
diff --git a/net/rds/Kconfig b/net/rds/Kconfig
index f2c670b..bffde4b 100644
--- a/net/rds/Kconfig
+++ b/net/rds/Kconfig
@@ -4,14 +4,13 @@ config RDS
depends on INET
---help---
  The RDS (Reliable Datagram Sockets) protocol provides reliable,
- sequenced delivery of datagrams over Infiniband, iWARP,
- or TCP.
+ sequenced delivery of datagrams over Infiniband or TCP.
 
 config RDS_RDMA
-   tristate "RDS over Infiniband and iWARP"
+   tristate "RDS over Infiniband"
depends on RDS && INFINIBAND && INFINIBAND_ADDR_TRANS
---help---
- Allow RDS to use Infiniband and iWARP as a transport.
+ Allow RDS to use Infiniband as a transport.
  This transport supports RDMA operations.
 
 config RDS_TCP
diff --git a/net/rds/Makefile b/net/rds/Makefile
index 56d3f60..19e5485 100644
--- a/net/rds/Makefile
+++ b/net/rds/Makefile
@@ -6,9 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o 
message.o   \
 obj-$(CONFIG_RDS_RDMA) += rds_rdma.o
 rds_rdma-y :=  rdma_transport.o \
ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
-   ib_sysctl.o ib_rdma.o \
-   iw.o iw_cm.o iw_recv.o iw_ring.o iw_send.o iw_stats.o \
-   iw_sysctl.o iw_rdma.o
+   ib_sysctl.o ib_rdma.o
 
 
 obj-$(CONFIG_RDS_TCP) += rds_tcp.o
diff --git a/net/rds/iw.c b/net/rds/iw.c
deleted file mode 100644
index f4a9fff..000
diff --git a/net/rds/iw.h b/net/rds/iw.h
deleted file mode 100644
index 5af01d1..000
diff --git a/net/rds/iw_cm.c b/net/rds/iw_cm.c
deleted file mode 100644
index aea4c91..000
diff --git a/net/rds/iw_rdma.c b/net/rds/iw_rdma.c
deleted file mode 100644
index b09a40c..000
diff --git a/net/rds/iw_recv.c b/net/rds/iw_recv.c
deleted file mode 100644
index a66d179..000
diff --git a/net/rds/iw_ring.c b/net/rds/iw_ring.c
deleted file mode 100644
index da8e3b6..000
diff --git a/net/rds/iw_send.c b/net/rds/iw_send.c
deleted file mode 100644
index e20bd50..000
diff --git a/net/rds/iw_stats.c b/net/rds/iw_stats.c
deleted file mode 100644
index 5fe67f6..000
diff --git a/net/rds/iw_sysctl.c b/net/rds/iw_sysctl.c
deleted file mode 100644
index 139239d..000
diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index 9c1fed8..4f4b3d8 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -49,9 +49,7 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
rdsdebug("conn %p id %p handling event %u (%s)\n", conn, cm_id,
 event->event, rdma_event_msg(event->event));
 
-   if (cm_id->device->node_type == RDMA_NODE_RNIC)
-   trans = &

[net-next][PATCH 02/13] RDS: Add support for SO_TIMESTAMP for incoming messages

2016-02-20 Thread Santosh Shilimkar
The SO_TIMESTAMP generates time stamp for each incoming RDS messages
User app can enable it by using SO_TIMESTAMP setsocketopt() at
SOL_SOCKET level. CMSG data of cmsg type SO_TIMESTAMP contains the
time stamp in struct timeval format.

Reviewed-by: Sowmini Varadhan 
Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/af_rds.c | 26 ++
 net/rds/rds.h|  1 +
 net/rds/recv.c   | 20 ++--
 3 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index b5476aeb..6beaeb1 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -277,6 +277,27 @@ static int rds_set_transport(struct rds_sock *rs, char 
__user *optval,
return rs->rs_transport ? 0 : -ENOPROTOOPT;
 }
 
+static int rds_enable_recvtstamp(struct sock *sk, char __user *optval,
+int optlen)
+{
+   int val, valbool;
+
+   if (optlen != sizeof(int))
+   return -EFAULT;
+
+   if (get_user(val, (int __user *)optval))
+   return -EFAULT;
+
+   valbool = val ? 1 : 0;
+
+   if (valbool)
+   sock_set_flag(sk, SOCK_RCVTSTAMP);
+   else
+   sock_reset_flag(sk, SOCK_RCVTSTAMP);
+
+   return 0;
+}
+
 static int rds_setsockopt(struct socket *sock, int level, int optname,
  char __user *optval, unsigned int optlen)
 {
@@ -312,6 +333,11 @@ static int rds_setsockopt(struct socket *sock, int level, 
int optname,
ret = rds_set_transport(rs, optval, optlen);
release_sock(sock->sk);
break;
+   case SO_TIMESTAMP:
+   lock_sock(sock->sk);
+   ret = rds_enable_recvtstamp(sock->sk, optval, optlen);
+   release_sock(sock->sk);
+   break;
default:
ret = -ENOPROTOOPT;
}
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 0e2797b..80256b0 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -222,6 +222,7 @@ struct rds_incoming {
__be32  i_saddr;
 
rds_rdma_cookie_t   i_rdma_cookie;
+   struct timeval  i_rx_tstamp;
 };
 
 struct rds_mr {
diff --git a/net/rds/recv.c b/net/rds/recv.c
index a00462b..c0be1ec 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -35,6 +35,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "rds.h"
 
@@ -46,6 +48,8 @@ void rds_inc_init(struct rds_incoming *inc, struct 
rds_connection *conn,
inc->i_conn = conn;
inc->i_saddr = saddr;
inc->i_rdma_cookie = 0;
+   inc->i_rx_tstamp.tv_sec = 0;
+   inc->i_rx_tstamp.tv_usec = 0;
 }
 EXPORT_SYMBOL_GPL(rds_inc_init);
 
@@ -228,6 +232,8 @@ void rds_recv_incoming(struct rds_connection *conn, __be32 
saddr, __be32 daddr,
rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong,
  be32_to_cpu(inc->i_hdr.h_len),
  inc->i_hdr.h_dport);
+   if (sock_flag(sk, SOCK_RCVTSTAMP))
+   do_gettimeofday(&inc->i_rx_tstamp);
rds_inc_addref(inc);
list_add_tail(&inc->i_item, &rs->rs_recv_queue);
__rds_wake_sk_sleep(sk);
@@ -381,7 +387,8 @@ static int rds_notify_cong(struct rds_sock *rs, struct 
msghdr *msghdr)
 /*
  * Receive any control messages.
  */
-static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg)
+static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg,
+struct rds_sock *rs)
 {
int ret = 0;
 
@@ -392,6 +399,15 @@ static int rds_cmsg_recv(struct rds_incoming *inc, struct 
msghdr *msg)
return ret;
}
 
+   if ((inc->i_rx_tstamp.tv_sec != 0) &&
+   sock_flag(rds_rs_to_sk(rs), SOCK_RCVTSTAMP)) {
+   ret = put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMP,
+  sizeof(struct timeval),
+  &inc->i_rx_tstamp);
+   if (ret)
+   return ret;
+   }
+
return 0;
 }
 
@@ -474,7 +490,7 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, 
size_t size,
msg->msg_flags |= MSG_TRUNC;
}
 
-   if (rds_cmsg_recv(inc, msg)) {
+   if (rds_cmsg_recv(inc, msg, rs)) {
ret = -EFAULT;
goto out;
}
-- 
1.9.1



[net-next][PATCH 05/13] RDS: IB: Re-organise ibmr code

2016-02-20 Thread Santosh Shilimkar
No functional changes. This is in preperation towards adding
fastreg memory resgitration support.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/Makefile  |   2 +-
 net/rds/ib.c  |  37 +++---
 net/rds/ib.h  |  25 +---
 net/rds/ib_fmr.c  | 217 +++
 net/rds/ib_mr.h   | 109 
 net/rds/ib_rdma.c | 379 +++---
 6 files changed, 422 insertions(+), 347 deletions(-)
 create mode 100644 net/rds/ib_fmr.c
 create mode 100644 net/rds/ib_mr.h

diff --git a/net/rds/Makefile b/net/rds/Makefile
index 19e5485..bcf5591 100644
--- a/net/rds/Makefile
+++ b/net/rds/Makefile
@@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o 
message.o   \
 obj-$(CONFIG_RDS_RDMA) += rds_rdma.o
 rds_rdma-y :=  rdma_transport.o \
ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
-   ib_sysctl.o ib_rdma.o
+   ib_sysctl.o ib_rdma.o ib_fmr.o
 
 
 obj-$(CONFIG_RDS_TCP) += rds_tcp.o
diff --git a/net/rds/ib.c b/net/rds/ib.c
index 9481d55..bb32cb9 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -42,15 +42,16 @@
 
 #include "rds.h"
 #include "ib.h"
+#include "ib_mr.h"
 
-unsigned int rds_ib_fmr_1m_pool_size = RDS_FMR_1M_POOL_SIZE;
-unsigned int rds_ib_fmr_8k_pool_size = RDS_FMR_8K_POOL_SIZE;
+unsigned int rds_ib_mr_1m_pool_size = RDS_MR_1M_POOL_SIZE;
+unsigned int rds_ib_mr_8k_pool_size = RDS_MR_8K_POOL_SIZE;
 unsigned int rds_ib_retry_count = RDS_IB_DEFAULT_RETRY_COUNT;
 
-module_param(rds_ib_fmr_1m_pool_size, int, 0444);
-MODULE_PARM_DESC(rds_ib_fmr_1m_pool_size, " Max number of 1M fmr per HCA");
-module_param(rds_ib_fmr_8k_pool_size, int, 0444);
-MODULE_PARM_DESC(rds_ib_fmr_8k_pool_size, " Max number of 8K fmr per HCA");
+module_param(rds_ib_mr_1m_pool_size, int, 0444);
+MODULE_PARM_DESC(rds_ib_mr_1m_pool_size, " Max number of 1M mr per HCA");
+module_param(rds_ib_mr_8k_pool_size, int, 0444);
+MODULE_PARM_DESC(rds_ib_mr_8k_pool_size, " Max number of 8K mr per HCA");
 module_param(rds_ib_retry_count, int, 0444);
 MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before reporting 
an error");
 
@@ -140,13 +141,13 @@ static void rds_ib_add_one(struct ib_device *device)
rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE);
 
rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32;
-   rds_ibdev->max_1m_fmrs = device->attrs.max_mr ?
+   rds_ibdev->max_1m_mrs = device->attrs.max_mr ?
min_t(unsigned int, (device->attrs.max_mr / 2),
- rds_ib_fmr_1m_pool_size) : rds_ib_fmr_1m_pool_size;
+ rds_ib_mr_1m_pool_size) : rds_ib_mr_1m_pool_size;
 
-   rds_ibdev->max_8k_fmrs = device->attrs.max_mr ?
+   rds_ibdev->max_8k_mrs = device->attrs.max_mr ?
min_t(unsigned int, ((device->attrs.max_mr / 2) * 
RDS_MR_8K_SCALE),
- rds_ib_fmr_8k_pool_size) : rds_ib_fmr_8k_pool_size;
+ rds_ib_mr_8k_pool_size) : rds_ib_mr_8k_pool_size;
 
rds_ibdev->max_initiator_depth = device->attrs.max_qp_init_rd_atom;
rds_ibdev->max_responder_resources = device->attrs.max_qp_rd_atom;
@@ -172,10 +173,10 @@ static void rds_ib_add_one(struct ib_device *device)
goto put_dev;
}
 
-   rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, 
fmr_max_remaps = %d, max_1m_fmrs = %d, max_8k_fmrs = %d\n",
+   rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, 
fmr_max_remaps = %d, max_1m_mrs = %d, max_8k_mrs = %d\n",
 device->attrs.max_fmr, rds_ibdev->max_wrs, rds_ibdev->max_sge,
-rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_fmrs,
-rds_ibdev->max_8k_fmrs);
+rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs,
+rds_ibdev->max_8k_mrs);
 
INIT_LIST_HEAD(&rds_ibdev->ipaddr_list);
INIT_LIST_HEAD(&rds_ibdev->conn_list);
@@ -364,7 +365,7 @@ void rds_ib_exit(void)
rds_ib_sysctl_exit();
rds_ib_recv_exit();
rds_trans_unregister(&rds_ib_transport);
-   rds_ib_fmr_exit();
+   rds_ib_mr_exit();
 }
 
 struct rds_transport rds_ib_transport = {
@@ -400,13 +401,13 @@ int rds_ib_init(void)
 
INIT_LIST_HEAD(&rds_ib_devices);
 
-   ret = rds_ib_fmr_init();
+   ret = rds_ib_mr_init();
if (ret)
goto out;
 
ret = ib_register_client(&rds_ib_client);
if (ret)
-   goto out_fmr_exit;
+   goto out_mr_exit;
 
ret = rds_ib_sysctl_init();
if (ret)
@@ -430,8 +431,8 @@ out_sysctl:
rds_ib_sysctl_exit();
 out_ibreg:
rds_ib_unregister_client();
-out_fmr_exi

[net-next][PATCH 03/13] MAINTAINERS: update RDS entry

2016-02-20 Thread Santosh Shilimkar
Acked-by: Chien Yen 
Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 MAINTAINERS | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 355e1c8..9d79bea 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9081,10 +9081,14 @@ S:  Maintained
 F: drivers/net/ethernet/rdc/r6040.c
 
 RDS - RELIABLE DATAGRAM SOCKETS
-M: Chien Yen 
+M: Santosh Shilimkar 
+L: net...@vger.kernel.org
+L: linux-r...@vger.kernel.org
 L: rds-de...@oss.oracle.com (moderated for non-subscribers)
+W: https://oss.oracle.com/projects/rds/
 S: Supported
 F: net/rds/
+F: Documentation/networking/rds.txt
 
 READ-COPY UPDATE (RCU)
 M: "Paul E. McKenney" 
-- 
1.9.1



[net-next][PATCH 12/13] RDS: IB: allocate extra space on queues for FRMR support

2016-02-20 Thread Santosh Shilimkar
Fastreg MR(FRMR) memory registration and invalidation makes use
of work request and completion queues for its operation. Patch
allocates extra queue space towards these operation(s).

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h|  4 
 net/rds/ib_cm.c | 16 
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index c5eddc2..eeb0d6c 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -14,6 +14,7 @@
 
 #define RDS_IB_DEFAULT_RECV_WR 1024
 #define RDS_IB_DEFAULT_SEND_WR 256
+#define RDS_IB_DEFAULT_FR_WR   512
 
 #define RDS_IB_DEFAULT_RETRY_COUNT 2
 
@@ -122,6 +123,9 @@ struct rds_ib_connection {
struct ib_wci_send_wc[RDS_IB_WC_MAX];
struct ib_wci_recv_wc[RDS_IB_WC_MAX];
 
+   /* To control the number of wrs from fastreg */
+   atomic_ti_fastreg_wrs;
+
/* interrupt handling */
struct tasklet_struct   i_send_tasklet;
struct tasklet_struct   i_recv_tasklet;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 7f68abc..83f4673 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -363,7 +363,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
struct ib_qp_init_attr attr;
struct ib_cq_init_attr cq_attr = {};
struct rds_ib_device *rds_ibdev;
-   int ret;
+   int ret, fr_queue_space;
 
/*
 * It's normal to see a null device if an incoming connection races
@@ -373,6 +373,12 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
if (!rds_ibdev)
return -EOPNOTSUPP;
 
+   /* The fr_queue_space is currently set to 512, to add extra space on
+* completion queue and send queue. This extra space is used for FRMR
+* registration and invalidation work requests
+*/
+   fr_queue_space = (rds_ibdev->use_fastreg ? RDS_IB_DEFAULT_FR_WR : 0);
+
/* add the conn now so that connection establishment has the dev */
rds_ib_add_conn(rds_ibdev, conn);
 
@@ -384,7 +390,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
/* Protection domain and memory range */
ic->i_pd = rds_ibdev->pd;
 
-   cq_attr.cqe = ic->i_send_ring.w_nr + 1;
+   cq_attr.cqe = ic->i_send_ring.w_nr + fr_queue_space + 1;
 
ic->i_send_cq = ib_create_cq(dev, rds_ib_cq_comp_handler_send,
 rds_ib_cq_event_handler, conn,
@@ -424,7 +430,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
attr.event_handler = rds_ib_qp_event_handler;
attr.qp_context = conn;
/* + 1 to allow for the single ack message */
-   attr.cap.max_send_wr = ic->i_send_ring.w_nr + 1;
+   attr.cap.max_send_wr = ic->i_send_ring.w_nr + fr_queue_space + 1;
attr.cap.max_recv_wr = ic->i_recv_ring.w_nr + 1;
attr.cap.max_send_sge = rds_ibdev->max_sge;
attr.cap.max_recv_sge = RDS_IB_RECV_SGE;
@@ -432,6 +438,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
attr.qp_type = IB_QPT_RC;
attr.send_cq = ic->i_send_cq;
attr.recv_cq = ic->i_recv_cq;
+   atomic_set(&ic->i_fastreg_wrs, RDS_IB_DEFAULT_FR_WR);
 
/*
 * XXX this can fail if max_*_wr is too large?  Are we supposed
@@ -751,7 +758,8 @@ void rds_ib_conn_shutdown(struct rds_connection *conn)
 */
wait_event(rds_ib_ring_empty_wait,
   rds_ib_ring_empty(&ic->i_recv_ring) &&
-  (atomic_read(&ic->i_signaled_sends) == 0));
+  (atomic_read(&ic->i_signaled_sends) == 0) &&
+  (atomic_read(&ic->i_fastreg_wrs) == 
RDS_IB_DEFAULT_FR_WR));
tasklet_kill(&ic->i_send_tasklet);
tasklet_kill(&ic->i_recv_tasklet);
 
-- 
1.9.1



[net-next][PATCH 09/13] RDS: IB: handle the RDMA CM time wait event

2016-02-20 Thread Santosh Shilimkar
Drop the RDS connection on RDMA_CM_EVENT_TIMEWAIT_EXIT so that
it can reconnect and resume.

While testing fastreg, this error happened in couple of tests but
was getting un-noticed.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/rdma_transport.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index 4f4b3d8..7220beb 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -117,6 +117,14 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
rds_conn_drop(conn);
break;
 
+   case RDMA_CM_EVENT_TIMEWAIT_EXIT:
+   if (conn) {
+   pr_info("RDS: RDMA_CM_EVENT_TIMEWAIT_EXIT event: 
dropping connection %pI4->%pI4\n",
+   &conn->c_laddr, &conn->c_faddr);
+   rds_conn_drop(conn);
+   }
+   break;
+
default:
/* things like device disconnect? */
printk(KERN_ERR "RDS: unknown event %u (%s)!\n",
-- 
1.9.1



[net-next][PATCH 04/13] RDS: IB: Remove the RDS_IB_SEND_OP dependency

2016-02-20 Thread Santosh Shilimkar
This helps to combine asynchronous fastreg MR completion handler
with send completion handler.

No functional change.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h  |  1 -
 net/rds/ib_cm.c   | 42 +++---
 net/rds/ib_send.c |  6 ++
 3 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index b3fdebb..09cd8e3 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -28,7 +28,6 @@
 #define RDS_IB_RECYCLE_BATCH_COUNT 32
 
 #define RDS_IB_WC_MAX  32
-#define RDS_IB_SEND_OP BIT_ULL(63)
 
 extern struct rw_semaphore rds_ib_devices_lock;
 extern struct list_head rds_ib_devices;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index da5a7fb..7f68abc 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -236,12 +236,10 @@ static void rds_ib_cq_comp_handler_recv(struct ib_cq *cq, 
void *context)
tasklet_schedule(&ic->i_recv_tasklet);
 }
 
-static void poll_cq(struct rds_ib_connection *ic, struct ib_cq *cq,
-   struct ib_wc *wcs,
-   struct rds_ib_ack_state *ack_state)
+static void poll_scq(struct rds_ib_connection *ic, struct ib_cq *cq,
+struct ib_wc *wcs)
 {
-   int nr;
-   int i;
+   int nr, i;
struct ib_wc *wc;
 
while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) {
@@ -251,10 +249,7 @@ static void poll_cq(struct rds_ib_connection *ic, struct 
ib_cq *cq,
 (unsigned long long)wc->wr_id, wc->status,
 wc->byte_len, be32_to_cpu(wc->ex.imm_data));
 
-   if (wc->wr_id & RDS_IB_SEND_OP)
-   rds_ib_send_cqe_handler(ic, wc);
-   else
-   rds_ib_recv_cqe_handler(ic, wc, ack_state);
+   rds_ib_send_cqe_handler(ic, wc);
}
}
 }
@@ -263,14 +258,12 @@ static void rds_ib_tasklet_fn_send(unsigned long data)
 {
struct rds_ib_connection *ic = (struct rds_ib_connection *)data;
struct rds_connection *conn = ic->conn;
-   struct rds_ib_ack_state state;
 
rds_ib_stats_inc(s_ib_tasklet_call);
 
-   memset(&state, 0, sizeof(state));
-   poll_cq(ic, ic->i_send_cq, ic->i_send_wc, &state);
+   poll_scq(ic, ic->i_send_cq, ic->i_send_wc);
ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP);
-   poll_cq(ic, ic->i_send_cq, ic->i_send_wc, &state);
+   poll_scq(ic, ic->i_send_cq, ic->i_send_wc);
 
if (rds_conn_up(conn) &&
(!test_bit(RDS_LL_SEND_FULL, &conn->c_flags) ||
@@ -278,6 +271,25 @@ static void rds_ib_tasklet_fn_send(unsigned long data)
rds_send_xmit(ic->conn);
 }
 
+static void poll_rcq(struct rds_ib_connection *ic, struct ib_cq *cq,
+struct ib_wc *wcs,
+struct rds_ib_ack_state *ack_state)
+{
+   int nr, i;
+   struct ib_wc *wc;
+
+   while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) {
+   for (i = 0; i < nr; i++) {
+   wc = wcs + i;
+   rdsdebug("wc wr_id 0x%llx status %u byte_len %u 
imm_data %u\n",
+(unsigned long long)wc->wr_id, wc->status,
+wc->byte_len, be32_to_cpu(wc->ex.imm_data));
+
+   rds_ib_recv_cqe_handler(ic, wc, ack_state);
+   }
+   }
+}
+
 static void rds_ib_tasklet_fn_recv(unsigned long data)
 {
struct rds_ib_connection *ic = (struct rds_ib_connection *)data;
@@ -291,9 +303,9 @@ static void rds_ib_tasklet_fn_recv(unsigned long data)
rds_ib_stats_inc(s_ib_tasklet_call);
 
memset(&state, 0, sizeof(state));
-   poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, &state);
+   poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, &state);
ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED);
-   poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, &state);
+   poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, &state);
 
if (state.ack_next_valid)
rds_ib_set_ack(ic, state.ack_next, state.ack_required);
diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index eac30bf..f27d2c8 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -195,7 +195,7 @@ void rds_ib_send_init_ring(struct rds_ib_connection *ic)
 
send->s_op = NULL;
 
-   send->s_wr.wr_id = i | RDS_IB_SEND_OP;
+   send->s_wr.wr_id = i;
send->s_wr.sg_list = send->s_sge;
send->s_wr.ex.imm_data = 0;
 
@@ -263,9 +263,7 @@ void rds_ib_send_cqe_handler(struct rds_ib_connection *ic, 
struct ib_wc *wc)
 
oldest = rds_ib_ring_oldest(&ic->i_send_r

[net-next][PATCH 07/13] RDS: IB: move FMR code to its own file

2016-02-20 Thread Santosh Shilimkar
No functional change.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_fmr.c  | 126 +-
 net/rds/ib_mr.h   |   6 +++
 net/rds/ib_rdma.c | 105 ++---
 3 files changed, 133 insertions(+), 104 deletions(-)

diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c
index 74f2c21..4fe8f4f 100644
--- a/net/rds/ib_fmr.c
+++ b/net/rds/ib_fmr.c
@@ -37,61 +37,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
struct rds_ib_mr_pool *pool;
struct rds_ib_mr *ibmr = NULL;
struct rds_ib_fmr *fmr;
-   int err = 0, iter = 0;
+   int err = 0;
 
if (npages <= RDS_MR_8K_MSG_SIZE)
pool = rds_ibdev->mr_8k_pool;
else
pool = rds_ibdev->mr_1m_pool;
 
-   if (atomic_read(&pool->dirty_count) >= pool->max_items / 10)
-   queue_delayed_work(rds_ib_mr_wq, &pool->flush_worker, 10);
-
-   /* Switch pools if one of the pool is reaching upper limit */
-   if (atomic_read(&pool->dirty_count) >=  pool->max_items * 9 / 10) {
-   if (pool->pool_type == RDS_IB_MR_8K_POOL)
-   pool = rds_ibdev->mr_1m_pool;
-   else
-   pool = rds_ibdev->mr_8k_pool;
-   }
-
-   while (1) {
-   ibmr = rds_ib_reuse_mr(pool);
-   if (ibmr)
-   return ibmr;
-
-   /* No clean MRs - now we have the choice of either
-* allocating a fresh MR up to the limit imposed by the
-* driver, or flush any dirty unused MRs.
-* We try to avoid stalling in the send path if possible,
-* so we allocate as long as we're allowed to.
-*
-* We're fussy with enforcing the FMR limit, though. If the
-* driver tells us we can't use more than N fmrs, we shouldn't
-* start arguing with it
-*/
-   if (atomic_inc_return(&pool->item_count) <= pool->max_items)
-   break;
-
-   atomic_dec(&pool->item_count);
-
-   if (++iter > 2) {
-   if (pool->pool_type == RDS_IB_MR_8K_POOL)
-   rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_depleted);
-   else
-   rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_depleted);
-   return ERR_PTR(-EAGAIN);
-   }
-
-   /* We do have some empty MRs. Flush them out. */
-   if (pool->pool_type == RDS_IB_MR_8K_POOL)
-   rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_wait);
-   else
-   rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_wait);
-   rds_ib_flush_mr_pool(pool, 0, &ibmr);
-   if (ibmr)
-   return ibmr;
-   }
+   ibmr = rds_ib_try_reuse_ibmr(pool);
+   if (ibmr)
+   return ibmr;
 
ibmr = kzalloc_node(sizeof(*ibmr), GFP_KERNEL,
rdsibdev_to_node(rds_ibdev));
@@ -218,3 +173,76 @@ out:
 
return ret;
 }
+
+struct rds_ib_mr *rds_ib_reg_fmr(struct rds_ib_device *rds_ibdev,
+struct scatterlist *sg,
+unsigned long nents,
+u32 *key)
+{
+   struct rds_ib_mr *ibmr = NULL;
+   struct rds_ib_fmr *fmr;
+   int ret;
+
+   ibmr = rds_ib_alloc_fmr(rds_ibdev, nents);
+   if (IS_ERR(ibmr))
+   return ibmr;
+
+   ibmr->device = rds_ibdev;
+   fmr = &ibmr->u.fmr;
+   ret = rds_ib_map_fmr(rds_ibdev, ibmr, sg, nents);
+   if (ret == 0)
+   *key = fmr->fmr->rkey;
+   else
+   rds_ib_free_mr(ibmr, 0);
+
+   return ibmr;
+}
+
+void rds_ib_unreg_fmr(struct list_head *list, unsigned int *nfreed,
+ unsigned long *unpinned, unsigned int goal)
+{
+   struct rds_ib_mr *ibmr, *next;
+   struct rds_ib_fmr *fmr;
+   LIST_HEAD(fmr_list);
+   int ret = 0;
+   unsigned int freed = *nfreed;
+
+   /* String all ib_mr's onto one list and hand them to  ib_unmap_fmr */
+   list_for_each_entry(ibmr, list, unmap_list) {
+   fmr = &ibmr->u.fmr;
+   list_add(&fmr->fmr->list, &fmr_list);
+   }
+
+   ret = ib_unmap_fmr(&fmr_list);
+   if (ret)
+   pr_warn("RDS/IB: FMR invalidation failed (err=%d)\n", ret);
+
+   /* Now we can destroy the DMA mapping and unpin any pages */
+   list_for_each_entry_safe(ibmr, next, list, unmap_list) {
+   fmr = &ibmr->u.fmr;
+   *unpinned += ibmr->sg_len;
+   

[net-next][PATCH 11/13] RDS: IB: add Fastreg MR (FRMR) detection support

2016-02-20 Thread Santosh Shilimkar
Discovere Fast Memmory Registration support using IB device
IB_DEVICE_MEM_MGT_EXTENSIONS. Certain HCA might support just FRMR
or FMR or both FMR and FRWR. In case both mr type are supported,
default FMR is used. Using module parameter 'prefer_frmr',
user can choose its preferred MR method for RDS. Ofcourse the
module parameter has no effect if the HCA support only FRMR
or only FRMR.

Default MR is still kept as FMR against what everyone else
is following. Default will be changed to FRMR once the
RDS performance with FRMR is comparable with FMR. The
work is in progress for the same.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.c| 14 ++
 net/rds/ib.h|  4 
 net/rds/ib_mr.h |  1 +
 3 files changed, 19 insertions(+)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index bb32cb9..68c94b0 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -47,6 +47,7 @@
 unsigned int rds_ib_mr_1m_pool_size = RDS_MR_1M_POOL_SIZE;
 unsigned int rds_ib_mr_8k_pool_size = RDS_MR_8K_POOL_SIZE;
 unsigned int rds_ib_retry_count = RDS_IB_DEFAULT_RETRY_COUNT;
+bool prefer_frmr;
 
 module_param(rds_ib_mr_1m_pool_size, int, 0444);
 MODULE_PARM_DESC(rds_ib_mr_1m_pool_size, " Max number of 1M mr per HCA");
@@ -54,6 +55,8 @@ module_param(rds_ib_mr_8k_pool_size, int, 0444);
 MODULE_PARM_DESC(rds_ib_mr_8k_pool_size, " Max number of 8K mr per HCA");
 module_param(rds_ib_retry_count, int, 0444);
 MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before reporting 
an error");
+module_param(prefer_frmr, bool, 0444);
+MODULE_PARM_DESC(prefer_frmr, "Preferred MR method if both FMR and FRMR 
supported");
 
 /*
  * we have a clumsy combination of RCU and a rwsem protecting this list
@@ -140,6 +143,13 @@ static void rds_ib_add_one(struct ib_device *device)
rds_ibdev->max_wrs = device->attrs.max_qp_wr;
rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE);
 
+   rds_ibdev->has_fr = (device->attrs.device_cap_flags &
+ IB_DEVICE_MEM_MGT_EXTENSIONS);
+   rds_ibdev->has_fmr = (device->alloc_fmr && device->dealloc_fmr &&
+   device->map_phys_fmr && device->unmap_fmr);
+   rds_ibdev->use_fastreg = (rds_ibdev->has_fr &&
+(!rds_ibdev->has_fmr || prefer_frmr));
+
rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32;
rds_ibdev->max_1m_mrs = device->attrs.max_mr ?
min_t(unsigned int, (device->attrs.max_mr / 2),
@@ -178,6 +188,10 @@ static void rds_ib_add_one(struct ib_device *device)
 rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs,
 rds_ibdev->max_8k_mrs);
 
+   pr_info("RDS/IB: %s: %s supported and preferred\n",
+   device->name,
+   rds_ibdev->use_fastreg ? "FRMR" : "FMR");
+
INIT_LIST_HEAD(&rds_ibdev->ipaddr_list);
INIT_LIST_HEAD(&rds_ibdev->conn_list);
 
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 62fe7d5..c5eddc2 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -200,6 +200,10 @@ struct rds_ib_device {
struct list_headconn_list;
struct ib_device*dev;
struct ib_pd*pd;
+   boolhas_fmr;
+   boolhas_fr;
+   booluse_fastreg;
+
unsigned intmax_mrs;
struct rds_ib_mr_pool   *mr_1m_pool;
struct rds_ib_mr_pool   *mr_8k_pool;
diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index add7725..2f9b9c3 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -93,6 +93,7 @@ struct rds_ib_mr_pool {
 extern struct workqueue_struct *rds_ib_mr_wq;
 extern unsigned int rds_ib_mr_1m_pool_size;
 extern unsigned int rds_ib_mr_8k_pool_size;
+extern bool prefer_frmr;
 
 struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *rds_dev,
 int npages);
-- 
1.9.1



[net-next][PATCH 13/13] RDS: IB: Support Fastreg MR (FRMR) memory registration mode

2016-02-20 Thread Santosh Shilimkar
From: Avinash Repaka 

Fastreg MR(FRMR) is another method with which one can
register memory to HCA. Some of the newer HCAs supports only fastreg
mr mode, so we need to add support for it to RDS to have RDS functional
on them.

Some of the older HCAs support both FMR and FRMR modes. So to try out
FRMR on older HCAs, one can use module parameter 'prefer_frmr'

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Avinash Repaka 
Signed-off-by: Santosh Shilimkar 
---
RDS IB RDMA performance with FRMR is not yet as good as FMR and I do have
some patches in progress to address that. But they are not ready for 4.6
so I left them out of this series. 

 net/rds/Makefile  |   2 +-
 net/rds/ib.h  |   1 +
 net/rds/ib_cm.c   |   7 +-
 net/rds/ib_frmr.c | 376 ++
 net/rds/ib_mr.h   |  24 
 net/rds/ib_rdma.c |  17 ++-
 6 files changed, 422 insertions(+), 5 deletions(-)
 create mode 100644 net/rds/ib_frmr.c

diff --git a/net/rds/Makefile b/net/rds/Makefile
index bcf5591..0e72bec 100644
--- a/net/rds/Makefile
+++ b/net/rds/Makefile
@@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o 
message.o   \
 obj-$(CONFIG_RDS_RDMA) += rds_rdma.o
 rds_rdma-y :=  rdma_transport.o \
ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
-   ib_sysctl.o ib_rdma.o ib_fmr.o
+   ib_sysctl.o ib_rdma.o ib_fmr.o ib_frmr.o
 
 
 obj-$(CONFIG_RDS_TCP) += rds_tcp.o
diff --git a/net/rds/ib.h b/net/rds/ib.h
index eeb0d6c..627fb79 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -349,6 +349,7 @@ int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, 
__be32 ipaddr);
 void rds_ib_add_conn(struct rds_ib_device *rds_ibdev, struct rds_connection 
*conn);
 void rds_ib_remove_conn(struct rds_ib_device *rds_ibdev, struct rds_connection 
*conn);
 void rds_ib_destroy_nodev_conns(void);
+void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc);
 
 /* ib_recv.c */
 int rds_ib_recv_init(void);
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 83f4673..8764970 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -249,7 +249,12 @@ static void poll_scq(struct rds_ib_connection *ic, struct 
ib_cq *cq,
 (unsigned long long)wc->wr_id, wc->status,
 wc->byte_len, be32_to_cpu(wc->ex.imm_data));
 
-   rds_ib_send_cqe_handler(ic, wc);
+   if (wc->wr_id <= ic->i_send_ring.w_nr ||
+   wc->wr_id == RDS_IB_ACK_WR_ID)
+   rds_ib_send_cqe_handler(ic, wc);
+   else
+   rds_ib_mr_cqe_handler(ic, wc);
+
}
}
 }
diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
new file mode 100644
index 000..a86de13
--- /dev/null
+++ b/net/rds/ib_frmr.c
@@ -0,0 +1,376 @@
+/*
+ * Copyright (c) 2016 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include "ib_mr.h"
+
+static struct rds_ib_mr *rds_ib_alloc_frmr(struct rds_ib_device *rds_ibdev,
+  int npages)
+{
+   struct rds_ib_mr_pool *pool;
+   struct rds_ib_mr *ibmr = NULL;
+   struct rds_ib_frmr *frmr;
+   int err = 0;
+
+   if (npages <= RDS_MR_8K_MSG_SIZE)
+   pool = rds_ibdev->mr_8k_pool;
+   else
+   pool = rds_ibdev->mr_1m_pool;
+
+   ibmr = rds_ib_try_reuse_ibmr(pool);
+   if (ibmr)
+   retur

[net-next][PATCH 06/13] RDS: IB: create struct rds_ib_fmr

2016-02-20 Thread Santosh Shilimkar
Keep fmr related filed in its own struct. Fastreg MR structure
will be added to the union.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_fmr.c  | 17 ++---
 net/rds/ib_mr.h   | 11 +--
 net/rds/ib_rdma.c | 14 ++
 3 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c
index d4f200d..74f2c21 100644
--- a/net/rds/ib_fmr.c
+++ b/net/rds/ib_fmr.c
@@ -36,6 +36,7 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
 {
struct rds_ib_mr_pool *pool;
struct rds_ib_mr *ibmr = NULL;
+   struct rds_ib_fmr *fmr;
int err = 0, iter = 0;
 
if (npages <= RDS_MR_8K_MSG_SIZE)
@@ -99,15 +100,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
goto out_no_cigar;
}
 
-   ibmr->fmr = ib_alloc_fmr(rds_ibdev->pd,
+   fmr = &ibmr->u.fmr;
+   fmr->fmr = ib_alloc_fmr(rds_ibdev->pd,
(IB_ACCESS_LOCAL_WRITE |
 IB_ACCESS_REMOTE_READ |
 IB_ACCESS_REMOTE_WRITE |
 IB_ACCESS_REMOTE_ATOMIC),
&pool->fmr_attr);
-   if (IS_ERR(ibmr->fmr)) {
-   err = PTR_ERR(ibmr->fmr);
-   ibmr->fmr = NULL;
+   if (IS_ERR(fmr->fmr)) {
+   err = PTR_ERR(fmr->fmr);
+   fmr->fmr = NULL;
pr_warn("RDS/IB: %s failed (err=%d)\n", __func__, err);
goto out_no_cigar;
}
@@ -122,8 +124,8 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
 
 out_no_cigar:
if (ibmr) {
-   if (ibmr->fmr)
-   ib_dealloc_fmr(ibmr->fmr);
+   if (fmr->fmr)
+   ib_dealloc_fmr(fmr->fmr);
kfree(ibmr);
}
atomic_dec(&pool->item_count);
@@ -134,6 +136,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct 
rds_ib_mr *ibmr,
   struct scatterlist *sg, unsigned int nents)
 {
struct ib_device *dev = rds_ibdev->dev;
+   struct rds_ib_fmr *fmr = &ibmr->u.fmr;
struct scatterlist *scat = sg;
u64 io_addr = 0;
u64 *dma_pages;
@@ -190,7 +193,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct 
rds_ib_mr *ibmr,
(dma_addr & PAGE_MASK) + j;
}
 
-   ret = ib_map_phys_fmr(ibmr->fmr, dma_pages, page_cnt, io_addr);
+   ret = ib_map_phys_fmr(fmr->fmr, dma_pages, page_cnt, io_addr);
if (ret)
goto out;
 
diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index d88724f..309ad59 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -43,11 +43,15 @@
 #define RDS_MR_8K_SCALE(256 / (RDS_MR_8K_MSG_SIZE + 1))
 #define RDS_MR_8K_POOL_SIZE(RDS_MR_8K_SCALE * (8192 / 2))
 
+struct rds_ib_fmr {
+   struct ib_fmr   *fmr;
+   u64 *dma;
+};
+
 /* This is stored as mr->r_trans_private. */
 struct rds_ib_mr {
struct rds_ib_device*device;
struct rds_ib_mr_pool   *pool;
-   struct ib_fmr   *fmr;
 
struct llist_node   llnode;
 
@@ -57,8 +61,11 @@ struct rds_ib_mr {
 
struct scatterlist  *sg;
unsigned intsg_len;
-   u64 *dma;
int sg_dma_len;
+
+   union {
+   struct rds_ib_fmr   fmr;
+   } u;
 };
 
 /* Our own little MR pool */
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index c594519..9e608d9 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -334,6 +334,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
 int free_all, struct rds_ib_mr **ibmr_ret)
 {
struct rds_ib_mr *ibmr, *next;
+   struct rds_ib_fmr *fmr;
struct llist_node *clean_nodes;
struct llist_node *clean_tail;
LIST_HEAD(unmap_list);
@@ -395,8 +396,10 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
goto out;
 
/* String all ib_mr's onto one list and hand them to ib_unmap_fmr */
-   list_for_each_entry(ibmr, &unmap_list, unmap_list)
-   list_add(&ibmr->fmr->list, &fmr_list);
+   list_for_each_entry(ibmr, &unmap_list, unmap_list) {
+   fmr = &ibmr->u.fmr;
+   list_add(&fmr->fmr->list, &fmr_list);
+   }
 
ret = ib_unmap_fmr(&fmr_list);
if (ret)
@@ -405,6 +408,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
/* Now we can destroy the DMA mapping and unpin any pages */
list_for_each_entry_safe(ibmr, next, &unmap_list, unmap_list) {
unpinned += ibmr->sg_len;
+

[net-next][PATCH 08/13] RDS: IB: add connection info to ibmr

2016-02-20 Thread Santosh Shilimkar
Preperatory patch for FRMR support. From connection info,
we can retrieve cm_id which contains qp handled needed for
work request posting.

We also need to drop the RDS connection on QP error states
where connection handle becomes useful.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_mr.h | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index f5c1fcb..add7725 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -50,18 +50,19 @@ struct rds_ib_fmr {
 
 /* This is stored as mr->r_trans_private. */
 struct rds_ib_mr {
-   struct rds_ib_device*device;
-   struct rds_ib_mr_pool   *pool;
+   struct rds_ib_device*device;
+   struct rds_ib_mr_pool   *pool;
+   struct rds_ib_connection*ic;
 
-   struct llist_node   llnode;
+   struct llist_node   llnode;
 
/* unmap_list is for freeing */
-   struct list_headunmap_list;
-   unsigned intremap_count;
+   struct list_headunmap_list;
+   unsigned intremap_count;
 
-   struct scatterlist  *sg;
-   unsigned intsg_len;
-   int sg_dma_len;
+   struct scatterlist  *sg;
+   unsigned intsg_len;
+   int sg_dma_len;
 
union {
struct rds_ib_fmr   fmr;
-- 
1.9.1



<    1   2   3   4   5   6   7   8   9   10   >