Re: [PATCH 1/1] ARM: keystone: defconfig: Fix USB configuration
Hi Arnd, Olof, Can you please pick-up the fix for 4.8-rcx ? Roger reported that USB ports are broken on Keystone2 boards since v4.8-rc1 because USB_HPY config option got dropped. On 8/17/2016 3:44 AM, Roger Quadros wrote: Simply enabling CONFIG_KEYSTONE_USB_PHY doesn't work anymore as it depends on CONFIG_NOP_USB_XCEIV. We need to enable that as well. This fixes USB on Keystone boards from v4.8-rc1 onwards. Signed-off-by: Roger Quadros --- Acked-by: Santosh Shilimkar arch/arm/configs/keystone_defconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm/configs/keystone_defconfig b/arch/arm/configs/keystone_defconfig index 71b42e6..78cd2f1 100644 --- a/arch/arm/configs/keystone_defconfig +++ b/arch/arm/configs/keystone_defconfig @@ -161,6 +161,7 @@ CONFIG_USB_MON=y CONFIG_USB_XHCI_HCD=y CONFIG_USB_STORAGE=y CONFIG_USB_DWC3=y +CONFIG_NOP_USB_XCEIV=y CONFIG_KEYSTONE_USB_PHY=y CONFIG_NEW_LEDS=y CONFIG_LEDS_CLASS=y
Re: [PATCH 2/2] remoteproc: core: Rework obtaining a rproc from a DT phandle
+Suman, On 8/10/2016 10:15 AM, Bjorn Andersson wrote: On Tue 19 Jul 08:49 PDT 2016, Lee Jones wrote: In this patch we; - Use a subsystem generic phandle to obtain an rproc - We have to support TI's bespoke version for the time being - Convert wkup_m3_ipc driver to new API - Rename the call to be more like other, similar OF calls - Move feature-not-enabled inline stub to the headers - Strip out duplicate code by calling into of_get_rproc_by_index() Signed-off-by: Lee Jones --- drivers/remoteproc/remoteproc_core.c | 41 drivers/soc/ti/wkup_m3_ipc.c | 14 +++- include/linux/remoteproc.h | 4 ++-- 3 files changed, 14 insertions(+), 45 deletions(-) [..] diff --git a/drivers/soc/ti/wkup_m3_ipc.c b/drivers/soc/ti/wkup_m3_ipc.c index 8823cc8..15481f3 100644 --- a/drivers/soc/ti/wkup_m3_ipc.c +++ b/drivers/soc/ti/wkup_m3_ipc.c @@ -385,7 +385,6 @@ static int wkup_m3_ipc_probe(struct platform_device *pdev) { struct device *dev = &pdev->dev; int irq, ret; - phandle rproc_phandle; struct rproc *m3_rproc; struct resource *res; struct task_struct *task; @@ -430,16 +429,9 @@ static int wkup_m3_ipc_probe(struct platform_device *pdev) return PTR_ERR(m3_ipc->mbox); } - if (of_property_read_u32(dev->of_node, "ti,rproc", &rproc_phandle)) { - dev_err(&pdev->dev, "could not get rproc phandle\n"); - ret = -ENODEV; - goto err_free_mbox; - } - - m3_rproc = rproc_get_by_phandle(rproc_phandle); - if (!m3_rproc) { - dev_err(&pdev->dev, "could not get rproc handle\n"); - ret = -EPROBE_DEFER; + m3_rproc = of_get_rproc_by_phandle(dev->of_node); + if (IS_ERR(m3_rproc)) { + ret = PTR_ERR(m3_rproc); goto err_free_mbox; } Santosh, do you have any objections to me merging this? This looks ok to me but I have not been merdging the remote proc code. Looping Suman who IIRC, was looking at it along with Ohad. of_get_rproc_by_phandle() will fall back and attempt to acquire the handle from ti,rproc if the generic "rprocs" property doesn't exist. Suman, Can you please check this series and see if you can line this up ? Am not sure if Ohad still maintaining it. Regards, Santosh
Re: [PATCH v2 0/3] Add DSP control nodes for K2G
On 8/9/2016 7:33 AM, Andrew F. Davis wrote: Hello all, This series adds the nodes needed to control the DSP available on this SoC. These are similar to the nodes already present the other K2x SoCs. Thanks, Andrew Andrew F. Davis (3): ARM: dts: keystone-k2g: Add device state controller node ARM: dts: keystone-k2g: Add keystone IRQ controller node ARM: dts: keystone-k2g: Add DSP GPIO controller node The series looks good to me. I will add them to the next merge window queue...
Re: [PATCH] device probe: add self triggered delayed work request
On 8/8/2016 6:11 PM, Frank Rowand wrote: On 08/08/16 14:51, Qing Huang wrote: On 08/08/2016 01:44 PM, Frank Rowand wrote: On 07/29/16 22:39, Qing Huang wrote: In normal condition, the device probe requests kept in deferred queue would only be triggered for re-probing when another new device probe is finished successfully. This change will set up a delayed trigger work request if the current deferred probe being added is the only one in the queue. This delayed work request will try to reactivate any device from the deferred queue for re-probing later. By doing this, if the last device being probed in system boot process has a deferred probe error, this particular device will still be able to be probed again. I am trying to understand the use case. Can you explain the scenario you are trying to fix? If I understand correctly, you expect that something will change such that a later probe attempt will succeed. How will that change occur and why will the deferred probe list not be processed in this case? Why are you conditioning this on the deferred_probe_pending_list being empty? -Frank It turns out one corner case which we worried about has already been solved in the really_probe() function by comparing 'deferred_trigger_count' values. Another use case we are investigating now: when we probe a device, the main thread returns EPROBE_DEFER from the driver after we spawn a child thread to do the actual init work. So we can initialize multiple similar devices at the same time. After the child thread finishes its task, we can call driver_deferred_probe_trigger() directly from child thread to re-probe the device(driver_deferred_probe_trigger() has to be exported though). Or we could rely on something in this patch to re-probe the deferred devices from the pending list... What do you suggest? See commit 735a7ffb739b6efeaeb1e720306ba308eaaeb20e for how multi-threaded probes were intended to be handled. I don't know if this approach is used much or even usable, but that is the framework that was created. That infrastructure got removed as part of below commit :-( commit 5adc55da4a7758021bcc374904b0f8b076508a11 Author: Adrian Bunk Date: Tue Mar 27 03:02:51 2007 +0200 PCI: remove the broken PCI_MULTITHREAD_PROBE option This patch removes the PCI_MULTITHREAD_PROBE option that had already been marked as broken. Signed-off-by: Adrian Bunk Signed-off-by: Greg Kroah-Hartman
Re: [RFC PATCH] softirq: fix tasklet_kill() usage and users
ping !! On 8/1/2016 9:13 PM, Santosh Shilimkar wrote: Semantically the expectation from the tasklet init/kill API should be as below. tasklet_init() == Init and Enable scheduling tasklet_kill() == Disable scheduling and Destroy tasklet_init() API exibit above behavior but not the tasklet_kill(). The tasklet handler can still get scheduled and run even after the tasklet_kill(). There are 2, 3 places where drivers are working around this issue by calling tasklet_disable() which will add an usecount and there by avoiding the handlers being called. One of the example 'commit 1e1257860fd1 ("tty/serial: at91: correct the usage of tasklet")' tasklet_enable/tasklet_disable is a pair API and expected to be used together. Usage of tasklet_disable() *just* to workround tasklet scheduling after kill is probably not the correct and inteded use of the API as done the API. We also happen to see similar issue where in shutdown path the tasklet_handler was getting called even after the tasklet_kill(). We can fix this be making sure tasklet_kill() does right thing and there by ensuring tasklet handler won't run after tasklet_kil() with very simple change. Patch fixes the tasklet code and also few drivers hacks to workaround the issue. Cc: Greg Kroah-Hartman Cc: Andrew Morton Cc: Thomas Gleixner Cc: Tadeusz Struk Cc: Herbert Xu Cc: "David S. Miller" Cc: Paul Bolle Cc: Nicolas Ferre Signed-off-by: Santosh Shilimkar --- drivers/crypto/qat/qat_common/adf_isr.c| 1 - drivers/crypto/qat/qat_common/adf_sriov.c | 1 - drivers/crypto/qat/qat_common/adf_vf_isr.c | 2 -- drivers/isdn/gigaset/interface.c | 1 - drivers/tty/serial/atmel_serial.c | 1 - kernel/softirq.c | 7 --- 6 files changed, 4 insertions(+), 9 deletions(-) diff --git a/drivers/crypto/qat/qat_common/adf_isr.c b/drivers/crypto/qat/qat_common/adf_isr.c index 06d4901..fd5e900 100644 --- a/drivers/crypto/qat/qat_common/adf_isr.c +++ b/drivers/crypto/qat/qat_common/adf_isr.c @@ -296,7 +296,6 @@ static void adf_cleanup_bh(struct adf_accel_dev *accel_dev) int i; for (i = 0; i < hw_data->num_banks; i++) { - tasklet_disable(&priv_data->banks[i].resp_handler); tasklet_kill(&priv_data->banks[i].resp_handler); } } diff --git a/drivers/crypto/qat/qat_common/adf_sriov.c b/drivers/crypto/qat/qat_common/adf_sriov.c index 4a526e2..9e65888 100644 --- a/drivers/crypto/qat/qat_common/adf_sriov.c +++ b/drivers/crypto/qat/qat_common/adf_sriov.c @@ -204,7 +204,6 @@ void adf_disable_sriov(struct adf_accel_dev *accel_dev) } for (i = 0, vf = accel_dev->pf.vf_info; i < totalvfs; i++, vf++) { - tasklet_disable(&vf->vf2pf_bh_tasklet); tasklet_kill(&vf->vf2pf_bh_tasklet); mutex_destroy(&vf->pf2vf_lock); } diff --git a/drivers/crypto/qat/qat_common/adf_vf_isr.c b/drivers/crypto/qat/qat_common/adf_vf_isr.c index aa689ca..81e63bf 100644 --- a/drivers/crypto/qat/qat_common/adf_vf_isr.c +++ b/drivers/crypto/qat/qat_common/adf_vf_isr.c @@ -191,7 +191,6 @@ static int adf_setup_pf2vf_bh(struct adf_accel_dev *accel_dev) static void adf_cleanup_pf2vf_bh(struct adf_accel_dev *accel_dev) { - tasklet_disable(&accel_dev->vf.pf2vf_bh_tasklet); tasklet_kill(&accel_dev->vf.pf2vf_bh_tasklet); mutex_destroy(&accel_dev->vf.vf2pf_lock); } @@ -268,7 +267,6 @@ static void adf_cleanup_bh(struct adf_accel_dev *accel_dev) { struct adf_etr_data *priv_data = accel_dev->transport; - tasklet_disable(&priv_data->banks[0].resp_handler); tasklet_kill(&priv_data->banks[0].resp_handler); } diff --git a/drivers/isdn/gigaset/interface.c b/drivers/isdn/gigaset/interface.c index 600c79b..2ce63b6 100644 --- a/drivers/isdn/gigaset/interface.c +++ b/drivers/isdn/gigaset/interface.c @@ -524,7 +524,6 @@ void gigaset_if_free(struct cardstate *cs) if (!drv->have_tty) return; - tasklet_disable(&cs->if_wake_tasklet); tasklet_kill(&cs->if_wake_tasklet); cs->tty_dev = NULL; tty_unregister_device(drv->tty, cs->minor_index); diff --git a/drivers/tty/serial/atmel_serial.c b/drivers/tty/serial/atmel_serial.c index 954941d..27e638e 100644 --- a/drivers/tty/serial/atmel_serial.c +++ b/drivers/tty/serial/atmel_serial.c @@ -1915,7 +1915,6 @@ static void atmel_shutdown(struct uart_port *port) * Clear out any scheduled tasklets before * we destroy the buffers */ - tasklet_disable(&atmel_port->tasklet); tasklet_kill(&atmel_port->tasklet); /* diff --git a/kernel/softirq.c b/kernel/softirq.c index 17caf4b..21397eb 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -498,7 +498,7 @@ static void tasklet_action(struct softirq_action *a)
Re: [PATCH 1/1] RDS: add __printf format attribute to error reporting functions
On 8/5/2016 1:11 PM, Nicolas Iooss wrote: This is helpful to detect at compile-time errors related to format strings. Signed-off-by: Nicolas Iooss --- OK. Acked-by: Santosh Shilimkar
[RFC PATCH] softirq: fix tasklet_kill() usage and users
Semantically the expectation from the tasklet init/kill API should be as below. tasklet_init() == Init and Enable scheduling tasklet_kill() == Disable scheduling and Destroy tasklet_init() API exibit above behavior but not the tasklet_kill(). The tasklet handler can still get scheduled and run even after the tasklet_kill(). There are 2, 3 places where drivers are working around this issue by calling tasklet_disable() which will add an usecount and there by avoiding the handlers being called. One of the example 'commit 1e1257860fd1 ("tty/serial: at91: correct the usage of tasklet")' tasklet_enable/tasklet_disable is a pair API and expected to be used together. Usage of tasklet_disable() *just* to workround tasklet scheduling after kill is probably not the correct and inteded use of the API as done the API. We also happen to see similar issue where in shutdown path the tasklet_handler was getting called even after the tasklet_kill(). We can fix this be making sure tasklet_kill() does right thing and there by ensuring tasklet handler won't run after tasklet_kil() with very simple change. Patch fixes the tasklet code and also few drivers hacks to workaround the issue. Cc: Greg Kroah-Hartman Cc: Andrew Morton Cc: Thomas Gleixner Cc: Tadeusz Struk Cc: Herbert Xu Cc: "David S. Miller" Cc: Paul Bolle Cc: Nicolas Ferre Signed-off-by: Santosh Shilimkar --- drivers/crypto/qat/qat_common/adf_isr.c| 1 - drivers/crypto/qat/qat_common/adf_sriov.c | 1 - drivers/crypto/qat/qat_common/adf_vf_isr.c | 2 -- drivers/isdn/gigaset/interface.c | 1 - drivers/tty/serial/atmel_serial.c | 1 - kernel/softirq.c | 7 --- 6 files changed, 4 insertions(+), 9 deletions(-) diff --git a/drivers/crypto/qat/qat_common/adf_isr.c b/drivers/crypto/qat/qat_common/adf_isr.c index 06d4901..fd5e900 100644 --- a/drivers/crypto/qat/qat_common/adf_isr.c +++ b/drivers/crypto/qat/qat_common/adf_isr.c @@ -296,7 +296,6 @@ static void adf_cleanup_bh(struct adf_accel_dev *accel_dev) int i; for (i = 0; i < hw_data->num_banks; i++) { - tasklet_disable(&priv_data->banks[i].resp_handler); tasklet_kill(&priv_data->banks[i].resp_handler); } } diff --git a/drivers/crypto/qat/qat_common/adf_sriov.c b/drivers/crypto/qat/qat_common/adf_sriov.c index 4a526e2..9e65888 100644 --- a/drivers/crypto/qat/qat_common/adf_sriov.c +++ b/drivers/crypto/qat/qat_common/adf_sriov.c @@ -204,7 +204,6 @@ void adf_disable_sriov(struct adf_accel_dev *accel_dev) } for (i = 0, vf = accel_dev->pf.vf_info; i < totalvfs; i++, vf++) { - tasklet_disable(&vf->vf2pf_bh_tasklet); tasklet_kill(&vf->vf2pf_bh_tasklet); mutex_destroy(&vf->pf2vf_lock); } diff --git a/drivers/crypto/qat/qat_common/adf_vf_isr.c b/drivers/crypto/qat/qat_common/adf_vf_isr.c index aa689ca..81e63bf 100644 --- a/drivers/crypto/qat/qat_common/adf_vf_isr.c +++ b/drivers/crypto/qat/qat_common/adf_vf_isr.c @@ -191,7 +191,6 @@ static int adf_setup_pf2vf_bh(struct adf_accel_dev *accel_dev) static void adf_cleanup_pf2vf_bh(struct adf_accel_dev *accel_dev) { - tasklet_disable(&accel_dev->vf.pf2vf_bh_tasklet); tasklet_kill(&accel_dev->vf.pf2vf_bh_tasklet); mutex_destroy(&accel_dev->vf.vf2pf_lock); } @@ -268,7 +267,6 @@ static void adf_cleanup_bh(struct adf_accel_dev *accel_dev) { struct adf_etr_data *priv_data = accel_dev->transport; - tasklet_disable(&priv_data->banks[0].resp_handler); tasklet_kill(&priv_data->banks[0].resp_handler); } diff --git a/drivers/isdn/gigaset/interface.c b/drivers/isdn/gigaset/interface.c index 600c79b..2ce63b6 100644 --- a/drivers/isdn/gigaset/interface.c +++ b/drivers/isdn/gigaset/interface.c @@ -524,7 +524,6 @@ void gigaset_if_free(struct cardstate *cs) if (!drv->have_tty) return; - tasklet_disable(&cs->if_wake_tasklet); tasklet_kill(&cs->if_wake_tasklet); cs->tty_dev = NULL; tty_unregister_device(drv->tty, cs->minor_index); diff --git a/drivers/tty/serial/atmel_serial.c b/drivers/tty/serial/atmel_serial.c index 954941d..27e638e 100644 --- a/drivers/tty/serial/atmel_serial.c +++ b/drivers/tty/serial/atmel_serial.c @@ -1915,7 +1915,6 @@ static void atmel_shutdown(struct uart_port *port) * Clear out any scheduled tasklets before * we destroy the buffers */ - tasklet_disable(&atmel_port->tasklet); tasklet_kill(&atmel_port->tasklet); /* diff --git a/kernel/softirq.c b/kernel/softirq.c index 17caf4b..21397eb 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -498,7 +498,7 @@ static void tasklet_action(struct softirq_action *a) list = list-&
Re: [PATCH V2 45/63] clocksource/drivers/timer-keystone: Convert init function to return error
On 6/16/2016 2:27 PM, Daniel Lezcano wrote: The init functions do not return any error. They behave as the following: - panic, thus leading to a kernel crash while another timer may work and make the system boot up correctly or - print an error and let the caller unaware if the state of the system Change that by converting the init functions to return an error conforming to the CLOCKSOURCE_OF_RET prototype. Proper error handling (rollback, errno value) will be changed later case by case, thus this change just return back an error or success in the init function. Signed-off-by: Daniel Lezcano --- drivers/clocksource/timer-keystone.c | 15 --- 1 file changed, 8 insertions(+), 7 deletions(-) Acked-by: Santosh Shilimkar
Re: [PATCH] ARM: keystone: remove redundant depends on ARM_PATCH_PHYS_VIRT
On 6/13/2016 11:17 PM, Masahiro Yamada wrote: Hi Santosh Ping again. It is taking so long for this apparently correct patch. I thought it was already picked up. Will apply it for next merge window. Regards, Santosh
Re: [PATCH v2 0/3] ARM: Keystone: Add pinmuxing support
On 6/9/2016 8:26 AM, Franklin S Cooper Jr wrote: Unlike most Keystone 2 devices, K2G supports pinmuxing of its pins. This patch series enables pinmuxing for Keystone 2 devices. Version 2 changes: Rebased on top of linux-next which includes Keerthy patches. Series applied. Should start showing up in linux-next soon. Regards, Santosh
Re: [PATCH 3/3] ARM: configs: keystone: Enable PINCTRL_SINGLE Config
Franklin, On 6/6/2016 9:00 AM, Santosh Shilimkar wrote: On 6/5/2016 9:56 PM, Keerthy wrote: [...] Santosh, I posted a consolidated series for k2l. Thanks. Will pick that up. Franklin, Could you re-post k2g series on top of the series i posted today. I have update the keystone 4.8 branches and linux-next branch. Please refresh your DTS patches against [1] and post the same. Regards, Santosh [1] git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux-keystone.git for_4.8/keystone_dts
Re: [PATCH] RDS: IB: Remove deprecated create_workqueue
Hi, On 6/7/2016 12:33 PM, Bhaktipriya Shridhar wrote: alloc_workqueue replaces deprecated create_workqueue(). Since the driver is infiniband which can be used as block device and the workqueue seems involved in regular operation of the device, so a dedicated workqueue has been used with WQ_MEM_RECLAIM set to guarantee forward progress under memory pressure. Since there are only a fixed number of work items, explicit concurrency limit is unnecessary here. Signed-off-by: Bhaktipriya Shridhar --- Looks fine. Acked-by: Santosh Shilimkar
Re: [RFC v2 4/4] ARM: keystone: dma-coherent with safe fallback
(Joining discussion late since only this thread showed up in my inbox) On 6/6/2016 5:32 AM, Russell King - ARM Linux wrote: On Mon, Jun 06, 2016 at 12:59:18PM +0100, Mark Rutland wrote: I agree that whether or not devices are coherent in practice depends on the kernel's configuration. The flip side, as you point out, is that devices are coherent when a specific set of attributes are used. i.e. that if you read dma-coherent as meaning "coherent iff Normal, Inner Shareable, Inner WB Cacheable, Outer WB Cacheable", then dma-coherent consistently describes the same thing, rather than depending on the configuration of the OS. I think there is a bit of miss-understanding with 'dma-coherent' DT property and as RMK pointed out "dma-coherent-outer" isn't right direction either. DT is a datastructure provided to the kernel, potentially without deep internal knowledge of that kernel configuration. Having a consistent rule that is independent of the kernel configuration seems worth aiming for. I think you've missed the point. dma-coherent is _already_ dependent on the kernel configuration. "Having a consistent rule that is independent of the kernel configuration" is already an impossibility, as I illustrated in my previous message concerning Marvell Armada SoCs, and you also said in your preceding paragraph! For example, if you clear the shared bit in the page tables on non-LPAE SoCs, devices are no longer coherent. DMA coherence on ARM _is_ already tightly linked with the kernel configuration. You already can't get away from that, so I think you should give up trying to argue that point. :) Whether devices are DMA coherent is a combination of two things: * is the device connected to a coherent bus. * is the system setup to allow coherency on that bus to work. We capture the first through the dma-coherent property, which is clearly a per-device property. We ignore the second because we assume everyone is going to configure the CPU side correctly. That's untrue today, and it's untrue not only because of Keystone II, but also because of other SoCs as well which pre-date Keystone II. We currently miss out on considering that, because if we ignore it, we get something that works for most platforms. I agree with Russell. When I added "dma-coherent" per device DT property, the intention was to distinguish certain devices which may not be coherent sitting on coherent fabric for some hardware reasons. I don't see that adding a dma-outer-coherent property helps this - it's muddying the waters somewhat - and it's also forcing additional complexity into places where we shouldn't have it. We would need to parse two properties in the DMA API code, and then combine it with knowledge as to how the system page tables have been setup. If they've been setup as inner sharable, then dma-coherent identifies whether the device is coherent. If they've been setup as outer sharable, then dma-outer-coherent specifies that and dma-coherent is meaningless. Sounds like a recipe for confusion. Exactly. We should leave the "dma-coherent" property to mark coherent vs non coherent device(s). The inner vs outer is really page table ARCH setup issue and should be handled exactly the way it was done first place to handle the special memory view(outside 4 GB). Keystone needs outer shared bit set while setting up MMU pages which is best done in MMU off mode while recreating the new page tables. Regards, Santosh
Re: [RFC v2 4/4] ARM: keystone: dma-coherent with safe fallback
On 6/6/2016 5:50 AM, William Mills wrote: I saw only v2 but seems like it already generated discussion(s) On 06/06/2016 07:42 AM, Mark Rutland wrote: On Mon, Jun 06, 2016 at 11:09:07AM +0200, Arnd Bergmann wrote: On Monday, June 6, 2016 9:56:27 AM CEST Mark Rutland wrote: [adding devicetree] On Sun, Jun 05, 2016 at 11:20:29PM -0400, Bill Mills wrote: Keystone2 can do DMA coherency but only if: 1) DDR3A DMA buffers are in high physical addresses (0x8__) (DDR3B does not have this constraint) 2) Memory is marked outer shared 3) DMA Master marks transactions as outer shared (This is taken care of in bootloader) Use outer shared instead of inner shared. This choice is done at early init time and uses the attr_mod facility If the kernel is not configured for LPAE and using high PA, or if the switch to outer shared fails, then we fail to meet this criteria. Under any of these conditions we veto any dma-coherent attributes in the DTB. I very much do not like this. As I previously mentioned [1], dma-coherent has de-facto semantics today. This series deliberately changes that, and inverts the relationship between DT and kernel (as the describption in the DT would now depend on the configuration of the kernel). I would prefer that we have a separate property (e.g. "dma-outer-coherent") to describe when a device can be coherent with Normal, Outer Shareable, Inner Write-Back, Outer Write-Back memory. Then the kernel can figure out whether or not device can be used coherently, depending on how it is configured. I share your concern, but I don't think the dma-outer-coherent attribute would be a good solution either. The problem really is that keystone is a platform that is sometimes coherent, depending purely on what kernel we run, and not at all on anything we can describe in devicetree, and I don't see any good way to capture the behavior of the hardware in generic DT bindings. I think that above doesn't quite capture the situation: Some DMA masters can be cache-coherent (only) with Outer Shareable transactions. That is a property we could capture inthe DT (e.g. dma-outer-coherent), and is independent of the kernel configuration. Whether or not the devices are coherent with the kernel's chosen memory attributes certainly depends on the kernel configuration, but that is not what we capture in the DT. So far, the assumption has been: - when running a non-LPAE kernel, keystone is not coherent, and we must ignore both the dma-coherent properties in devices and the dma-ranges properties in bus nodes. Correct. I wasn't able to spot if/where that was enforced. Is it possible to boot Keystone UP, !LPAE? Yes ... with the right combination of DTB, u-boot, u-boot vars, and kernel config. Mismatches either fail hard or use dma-coherent ops without actually providing coherency. I am attempting to make this less fragile. Mis-configured coherency can be dead-wrong and still only fail 1 transaction in 1,000,000. I have seen customers run for weeks or months w/o detecting the issue. Thats why I wanted the veto logic. There are 3 cases to cover: LPAE w/ high PA: this is the normal mode for KS2. Uses coherent dma-ops. !LPAE: obviously uses low PA and must use non-coherent dma-ops. LPAE w/ low PA: This happens with an LPAE kernel but the user has passed a low PA memory DTB and u-boot has not fixed it up. This case must also use non-coherent dma-ops Upstream DTS has keystone memory at the low PA. I agree with that. U-boot and kernel opt-in to the use of high PA. If you give high PA to a non-LPAE kernel I believe it will fail hard and fast. I can check. UP will mostly boot from boot view the memory. The keystone_pv_fixup() will bail out for higher PA. Let me know if you see otherwise. Regards, Santosh
Re: [PATCH 3/3] ARM: configs: keystone: Enable PINCTRL_SINGLE Config
On 6/5/2016 9:56 PM, Keerthy wrote: [...] Santosh, I posted a consolidated series for k2l. Thanks. Will pick that up. Franklin, Could you re-post k2g series on top of the series i posted today. Please do.
Re: [PATCH 0/5] ARM:Keystone: Add pinmuxing support
Franklin, On 6/3/2016 11:42 AM, Franklin Cooper Jr. wrote: Gentle ping on this series On 04/27/2016 09:11 AM, Franklin S Cooper Jr wrote: Unlike most Keystone 2 devices, K2G supports pinmuxing of its pins. This patch series enables pinmuxing for Keystone 2 devices. Franklin S Cooper Jr (1): ARM: keystone: defconfig: Enable PINCTRL SINGLE for Keystone 2 Lokesh Vutla (3): ARM: Keystone: Enable PINCTRL for Keystone ARCH ARM: dts: keystone: Header file for pinctrl constants ARM: dts: k2g-evm: Add pinmuxing for UART0 Vitaly Andrianov (1): ARM: dts: k2g: Add pinctrl support Can you please check if it needs to be refreshed against v4.7-rc1 and if yes please re-post it. I will apply it for 4.8 Regards, Santosh
Re: [PATCH] ARM: Keystone: Introduce Kconfig option to compile in typical Keystone features
On 6/2/2016 5:34 AM, Nishanth Menon wrote: On 06/01/2016 06:26 PM, Santosh Shilimkar wrote: [...] Side note on LPAE: For our current device tree and u-boot, LPAE is mandatory to bootup for current Keystone boards - but this is not a SoC requirement, booting without LPAE/HIGHMEM results in non-coherent DDR accesses. This sounds like a regression, I thought we had this working when keystone was initially merged and we got both the coherent and non-coherent mode working with the same DT. Yes and it works. The coherent memory space itself is beyond 4GB so Hmm... True, I just tested next-20160602 with mem_lpae set to 0 in u-boot and it seems to boot just fine. I don't understand a requirement of having coherent memory without LPAE. Looks like a messed up description on my end, Looks like I have to update my automated test framework to incorporate the manual steps involved. No worries. Am glad you got your setup working. Regards, Santosh
Re: [PATCH v1 1/2] ARM: dts: keystone: remove bogus IO resource entry from PCI binding
On 6/2/2016 8:17 AM, Murali Karicheri wrote: The PCI DT bindings contain a bogus entry for IO space which is not supported on Keystone. The current bogus entry has an invalid size and throws following error during boot. [0.420713] keystone-pcie 21021000.pcie: error -22: failed to map resource [io 0x-0x40003fff] So remove it from the dts. While at it also add a bus-range value that eliminates following log at boot up. [0.420659] No bus range found for /soc/pcie@2102, using [bus 00-ff] Signed-off-by: Murali Karicheri --- Both 1/2 and 2/2 looks fine to me. Will queue them for next merge window. Regards, Santosh
Re: [PATCH] rds: fix an infoleak in rds_inc_info_copy
On 6/2/2016 1:11 AM, Kangjie Lu wrote: The last field "flags" of object "minfo" is not initialized. Copying this object out may leak kernel stack data. Assign 0 to it to avoid leak. Signed-off-by: Kangjie Lu --- net/rds/recv.c | 2 ++ 1 file changed, 2 insertions(+) Acked-by: Santosh Shilimkar
Re: [PATCH] ARM: Keystone: Introduce Kconfig option to compile in typical Keystone features
On 6/1/2016 3:49 PM, Nishanth Menon wrote: On 06/01/2016 05:31 PM, Arnd Bergmann wrote: [...] Santosh, Bill, Lokesh, Grygorii: could you help feedback on the above comments from Arnd? Already responded to Arnds email.
Re: [PATCH] ARM: Keystone: Introduce Kconfig option to compile in typical Keystone features
On 6/1/2016 3:31 PM, Arnd Bergmann wrote: On Wednesday, June 1, 2016 4:31:54 PM CEST Nishanth Menon wrote: Introduce ARCH_KEYSTONE_TYPICAL which is common for all Keystone platforms. This is particularly useful when custom optimized defconfig builds are created for Keystone architecture platforms. An example of the same would be a sample fragment ks_only.cfg: http://pastebin.ubuntu.com/16904991/ - This prunes all arch other than keystone and any options the other architectures may enable. git clean -fdx && git reset --hard && \ ./scripts/kconfig/merge_config.sh -m \ ./arch/arm/configs/multi_v7_defconfig ~/tmp/ks_only.cfg &&\ make olddefconfig The above unfortunately will disable options necessary for KS2 boards to boot to the bare minimum initramfs. Hence the "KEYSTONE_TYPICAL" option is designed similar to commit 8d9166b519fd ("omap2/3/4: Add Kconfig option to compile in typical omap features") that can be enabled for most boards keystone platforms without needing to rediscover these in defconfig all over again - examples include multi_v7_defconfig base and optimizations done on top of them for keystone platform. I'd rather remove the option for OMAP as well, it doesn't really fit in with how we do things for other platforms, and selecting a lot of other Kconfig symbols tends to cause circular dependencies. Yes. NOTE: the alternative is to select the configurations under ARCH_KEYSTONE. However, that would fail multi_v7 builds on ARM variants that dont work with LPAE. Please no arbitrary selects from the platform. Cc: Bill Mills Cc: Murali Karicheri Cc: Grygorii Strashko Cc: Tero Kristo Cc: Lokesh Vutla Signed-off-by: Nishanth Menon --- Based on: next-20160601 Tested for basic initramfs boot for K2HK/K2G platforms with the http://pastebin.ubuntu.com/16904991/ fragment + multi_v7_defconfig Side note on LPAE: For our current device tree and u-boot, LPAE is mandatory to bootup for current Keystone boards - but this is not a SoC requirement, booting without LPAE/HIGHMEM results in non-coherent DDR accesses. This sounds like a regression, I thought we had this working when keystone was initially merged and we got both the coherent and non-coherent mode working with the same DT. Yes and it works. The coherent memory space itself is beyond 4GB so I don't understand a requirement of having coherent memory without LPAE. Currently: - U-Boot assumes that lpae is always enabled in kennel and updates the DT memory node with higher addresses. Because of which you are not detecting any memory without lpae and kernel crashed very early, hence no prints. So, make mem_lpae env setting as 0 in U-boot. We could work around this in the kernel by detecting the faulty u-boot behavior and fixing up the addresses in an early platform callback. U-boot is already doing that and I don't see any issue with it. - DT also assumes that lpae is always enabled, and always asks for dma-address translation for higher addresses to lower addresses. Just delete the "dma-ranges" property or create a one-on-one mapping like dma-ranges = <0x8000 0x0 0x8000 0x8000> This may be a bit trickier, I think originally keystone ignored the dma-ranges property and hacked up its own offset by adding a magic constant to the dma address using a bus notifier. We probably don't want to bring that hack back, but maybe we can come up with another solution. I don't think we should go on this path ever. U-boot should modify this parameter as done previously. Regards, Santosh
Re: [PATCH 3/3] ARM: configs: keystone: Enable PINCTRL_SINGLE Config
Hi Keerthy, On 5/23/2016 8:56 PM, Keerthy wrote: On Tuesday 24 May 2016 09:07 AM, Lokesh Vutla wrote: On Monday 23 May 2016 05:59 PM, Keerthy wrote: keystone-k2l devices use pinmux and are compliant with PINCTRL_SINGLE. Hence enable the config option. Signed-off-by: Keerthy A similar patch[1] is already posted. [1]https://patchwork.kernel.org/patch/8958091/ Ah I had not seen them. If they are already reviewed and closer to be merged then Patch 2 and Patch 3 of this series can be dropped. Once the 4.7-rc2 is out, please rebase these floating patches against it and post a consolidated patches. I will line them up for 4.8 Regards, Santosh
Re: [PATCH 46/54] MAINTAINERS: Add file patterns for ti device tree bindings
On 5/22/2016 2:06 AM, Geert Uytterhoeven wrote: Submitters of device tree binding documentation may forget to CC the subsystem maintainer if this is missing. Signed-off-by: Geert Uytterhoeven Cc: Santosh Shilimkar Cc: linux-kernel@vger.kernel.org Cc: linux-arm-ker...@lists.infradead.org --- Please apply this patch directly if you want to be involved in device tree binding documentation for your subsystem. --- Acked-by: Santosh Shilimkar
Re: [rcu_sched stall] regression/miss-config ?
Hi Paul, On 5/17/2016 12:15 PM, Paul E. McKenney wrote: On Tue, May 17, 2016 at 06:46:22AM -0700, santosh.shilim...@oracle.com wrote: On 5/16/16 5:58 PM, Paul E. McKenney wrote: On Mon, May 16, 2016 at 12:49:41PM -0700, Santosh Shilimkar wrote: On 5/16/2016 10:34 AM, Paul E. McKenney wrote: On Mon, May 16, 2016 at 09:33:57AM -0700, Santosh Shilimkar wrote: [...] Are you running CONFIG_NO_HZ_FULL=y? If so, the problem might be that you need more housekeeping CPUs than you currently have configured. Yes, CONFIG_NO_HZ_FULL=y. Do you mean "CONFIG_NO_HZ_FULL_ALL=y" for book keeping. Seems like without that clock-event code will just use CPU0 for things like broadcasting which might become bottleneck. This could explain connect the hrtimer_interrupt() path getting slowed down because of book keeping bottleneck. $cat .config | grep NO_HZ CONFIG_NO_HZ_COMMON=y # CONFIG_NO_HZ_IDLE is not set CONFIG_NO_HZ_FULL=y # CONFIG_NO_HZ_FULL_ALL is not set # CONFIG_NO_HZ_FULL_SYSIDLE is not set CONFIG_NO_HZ=y # CONFIG_RCU_FAST_NO_HZ is not set Yes, CONFIG_NO_HZ_FULL_ALL=y would give you only one CPU for all housekeeping tasks, including the RCU grace-period kthreads. So you are booting without any nohz_full boot parameter? You can end up with the same problem with CONFIG_NO_HZ_FULL=y and the nohz_full boot parameter that you can with CONFIG_NO_HZ_FULL_ALL=y. I see. Yes, the systems are booting without nohz_full boot parameter. Will try to add more CPUs to it & update the thread after the verification since it takes time to reproduce the issue. Thanks for discussion so far Paul. Its very insightful for me. Please let me know how things go with further testing, especially with the priority setting. Sorry for delay. I manage to get information about XEN usecase custom config as discussed above. To reduce variables, I disabled "CONFIG_NO_HZ_FULL" altogether. So the effective setting was: CONFIG_NO_HZ_IDLE=y # CONFIG_NO_HZ_FULL is not set CONFIG_TREE_RCU_TRACE=y CONFIG_RCU_KTHREAD_PRIO=1 CONFIG_RCU_CPU_STALL_TIMEOUT=21 CONFIG_RCU_TRACE=y Unfortunately the XEN test still failed. Log end of the email. This test(s) is bit peculiar though since its database running in VM with 1 or 2 CPUs. One of the suspect is because the database RT processes are hogging the CPU(s), kernel RCU thread is not getting chance to run which eventually results in stall. Does it make sense ? Please note that its non-preempt kernel using RT processes. ;-) # cat .config | grep PREEMPT CONFIG_PREEMPT_NOTIFIERS=y # CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT_VOLUNTARY=y # CONFIG_PREEMPT is not set Regards, Santosh ... rcu_sched kthread starved for 399032 jiffies! INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 0, t=462037 jiffies, g=11, c=118887, q=0) All QSes seen, last rcu_sched kthread activity 462037 (4296277632-4295815595), jiffies_till_next_fqs=3, root ->qsmask 0x0 ocssd.bin R running task0 15375 1 0x 8800ec003bc8 810a8581 81abf980 0001d068 8800ec003c28 810e9c98 0086 0086 0082 Call Trace: [] sched_show_task+0xb1/0x120 [] print_other_cpu_stall+0x288/0x2d0 [] __rcu_pending+0x180/0x230 [] rcu_check_callbacks+0x95/0x140 [] update_process_times+0x42/0x70 [] tick_sched_handle+0x39/0x80 [] tick_sched_timer+0x52/0xa0 [] __run_hrtimer+0x74/0x1d0 [] ? tick_nohz_handler+0xc0/0xc0 [] hrtimer_interrupt+0x102/0x240 [] xen_timer_interrupt+0x2e/0x130 [] ? add_interrupt_randomness+0x3a/0x1f0 [] ? store_cursor_blink+0xc0/0xc0 [] handle_irq_event_percpu+0x54/0x1b0 [] handle_percpu_irq+0x47/0x70 [] generic_handle_irq+0x27/0x40 [] evtchn_2l_handle_events+0x25a/0x260 [] ? __do_softirq+0x191/0x2f0 [] __xen_evtchn_do_upcall+0x4f/0x90 [] xen_evtchn_do_upcall+0x34/0x50 [] xen_hvm_callback_vector+0x6e/0x80 rcu_sched kthread starved for 462037 jiffies!
Re: [rcu_sched stall] regression/miss-config ?
On 5/16/2016 10:34 AM, Paul E. McKenney wrote: On Mon, May 16, 2016 at 09:33:57AM -0700, Santosh Shilimkar wrote: On 5/16/2016 5:03 AM, Paul E. McKenney wrote: On Sun, May 15, 2016 at 09:35:40PM -0700, santosh.shilim...@oracle.com wrote: On 5/15/16 2:18 PM, Santosh Shilimkar wrote: Hi Paul, I was asking Sasha about [1] since other folks in Oracle also stumbled upon similar RCU stalls with v4.1 kernel in different workloads. I was reported similar issue with RDS as well and looking at [1], [2], [3] and [4], thought of reaching out to see if you can help us to understand this issue better. Have also included RCU specific config used in these test(s). Its very hard to reproduce the issue but one of the data point is, it reproduces on systems with larger CPUs(64+). Same workload with less than 64 CPUs, don't show the issue. Someone also told me, making use of SLAB instead SLUB allocator makes difference but I haven't verified that part for RDS. Let me know your thoughts. Thanks in advance !! One of my colleague told me the pastebin server I used is Oracle internal only so adding the relevant logs along with email. [...] [1] https://lkml.org/lkml/2014/12/14/304 [2] Log 1 snippet: - INFO: rcu_sched self-detected stall on CPU INFO: rcu_sched self-detected stall on CPU { 54} (t=6 jiffies g=66023 c=66022 q=0) Task dump for CPU 54: ksoftirqd/54R running task0 389 2 0x0008 0007 88ff7f403d38 810a8621 0036 81ab6540 88ff7f403d58 810a86cf 0086 81ab6940 88ff7f403d88 810e3ad3 81ab6540 Call Trace: [] sched_show_task+0xb1/0x120 [] dump_cpu_task+0x3f/0x50 [] rcu_dump_cpu_stacks+0x83/0xc0 [] print_cpu_stall+0xfc/0x170 [] __rcu_pending+0x2bb/0x2c0 [] rcu_check_callbacks+0x9d/0x170 [] update_process_times+0x42/0x70 [] tick_sched_handle+0x39/0x80 [] tick_sched_timer+0x44/0x80 [] __run_hrtimer+0x74/0x1d0 [] ? tick_nohz_handler+0xa0/0xa0 [] hrtimer_interrupt+0x102/0x240 [] local_apic_timer_interrupt+0x39/0x60 [] smp_apic_timer_interrupt+0x45/0x59 [] apic_timer_interrupt+0x6e/0x80 [] ? free_one_page+0x164/0x380 [] ? __free_pages_ok+0xc3/0xe0 [] __free_pages+0x25/0x40 [] rds_message_purge+0x60/0x150 [rds] [] rds_message_put+0x44/0x80 [rds] [] rds_ib_send_cqe_handler+0x134/0x2d0 [rds_rdma] [] ? _raw_spin_unlock_irqrestore+0x1b/0x50 [] ? mlx4_ib_poll_cq+0xb3/0x2a0 [mlx4_ib] [] poll_cq+0xa1/0xe0 [rds_rdma] [] rds_ib_tasklet_fn_send+0x79/0xf0 [rds_rdma] The most likely possibility is that there is a 60-second-long loop in one of the above functions. This is within bottom-half execution, so unfortunately the usual trick of placing cond_resched_rcu_qs() within this loop, but outside of any RCU read-side critical section does not work. First of all thanks for explanation. There is no loop which can last for 60 seconds in above code since its just completion queue handler used to free up buffers much like NIC drivers bottom half(NAPI). Its done in tasklet context for latency reasons which RDS care most. Just to get your attention, the RCU stall is also seen with XEN code too. Log for it end of the email. Another important observation is, for RDS if we avoid higher order page(s) allocation, issue is not reproducible so far. In other words, for PAGE_SIZE(4K, get_order(bytes) ==0) allocations, the system continues to run without any issue, so the loop scenario is ruled out more or less. To be specific, with PAGE_SIZE allocations, alloc_pages() is just allocating a page and __free_page() is used instead of __free_pages() from below snippet. -- if (bytes >= PAGE_SIZE) page = alloc_pages(gfp, get_order(bytes)); . (rm->data.op_sg[i].length <= PAGE_SIZE) ? __free_page(sg_page(&rm->data.op_sg[i])) : __free_pages(sg_page(&rm->data.op_sg[i]), get_order(rm->data.op_sg[i].length)); This sounds like something to take up with the mm folks. Sure. Will do once the link between two issues is established. Therefore, if there really is a loop here, one fix would be to periodically unwind back out to run_ksoftirqd(), but setting up so that the work would be continued later. Another fix might be to move this >from tasklet context to workqueue context, where cond_resched_rcu_qs() can be used -- however, this looks a bit like networking code, which does not always take kindly to being run in process context (though careful use of local_bh_disable() and local_bh_enable() can sometimes overcome this issue). A third fix, which works only if this code does not use RCU and does not invoke any code that does use RCU, is to tell RCU that it should ignore this code (which will require a little work on RCU, as it currently does not tolerate this sort of thing aside from the idle threads). In this last approach, event-tra
Re: [rcu_sched stall] regression/miss-config ?
On 5/16/2016 5:03 AM, Paul E. McKenney wrote: On Sun, May 15, 2016 at 09:35:40PM -0700, santosh.shilim...@oracle.com wrote: On 5/15/16 2:18 PM, Santosh Shilimkar wrote: Hi Paul, I was asking Sasha about [1] since other folks in Oracle also stumbled upon similar RCU stalls with v4.1 kernel in different workloads. I was reported similar issue with RDS as well and looking at [1], [2], [3] and [4], thought of reaching out to see if you can help us to understand this issue better. Have also included RCU specific config used in these test(s). Its very hard to reproduce the issue but one of the data point is, it reproduces on systems with larger CPUs(64+). Same workload with less than 64 CPUs, don't show the issue. Someone also told me, making use of SLAB instead SLUB allocator makes difference but I haven't verified that part for RDS. Let me know your thoughts. Thanks in advance !! One of my colleague told me the pastebin server I used is Oracle internal only so adding the relevant logs along with email. [...] [1] https://lkml.org/lkml/2014/12/14/304 [2] Log 1 snippet: - INFO: rcu_sched self-detected stall on CPU INFO: rcu_sched self-detected stall on CPU { 54} (t=6 jiffies g=66023 c=66022 q=0) Task dump for CPU 54: ksoftirqd/54R running task0 389 2 0x0008 0007 88ff7f403d38 810a8621 0036 81ab6540 88ff7f403d58 810a86cf 0086 81ab6940 88ff7f403d88 810e3ad3 81ab6540 Call Trace: [] sched_show_task+0xb1/0x120 [] dump_cpu_task+0x3f/0x50 [] rcu_dump_cpu_stacks+0x83/0xc0 [] print_cpu_stall+0xfc/0x170 [] __rcu_pending+0x2bb/0x2c0 [] rcu_check_callbacks+0x9d/0x170 [] update_process_times+0x42/0x70 [] tick_sched_handle+0x39/0x80 [] tick_sched_timer+0x44/0x80 [] __run_hrtimer+0x74/0x1d0 [] ? tick_nohz_handler+0xa0/0xa0 [] hrtimer_interrupt+0x102/0x240 [] local_apic_timer_interrupt+0x39/0x60 [] smp_apic_timer_interrupt+0x45/0x59 [] apic_timer_interrupt+0x6e/0x80 [] ? free_one_page+0x164/0x380 [] ? __free_pages_ok+0xc3/0xe0 [] __free_pages+0x25/0x40 [] rds_message_purge+0x60/0x150 [rds] [] rds_message_put+0x44/0x80 [rds] [] rds_ib_send_cqe_handler+0x134/0x2d0 [rds_rdma] [] ? _raw_spin_unlock_irqrestore+0x1b/0x50 [] ? mlx4_ib_poll_cq+0xb3/0x2a0 [mlx4_ib] [] poll_cq+0xa1/0xe0 [rds_rdma] [] rds_ib_tasklet_fn_send+0x79/0xf0 [rds_rdma] The most likely possibility is that there is a 60-second-long loop in one of the above functions. This is within bottom-half execution, so unfortunately the usual trick of placing cond_resched_rcu_qs() within this loop, but outside of any RCU read-side critical section does not work. First of all thanks for explanation. There is no loop which can last for 60 seconds in above code since its just completion queue handler used to free up buffers much like NIC drivers bottom half(NAPI). Its done in tasklet context for latency reasons which RDS care most. Just to get your attention, the RCU stall is also seen with XEN code too. Log for it end of the email. Another important observation is, for RDS if we avoid higher order page(s) allocation, issue is not reproducible so far. In other words, for PAGE_SIZE(4K, get_order(bytes) ==0) allocations, the system continues to run without any issue, so the loop scenario is ruled out more or less. To be specific, with PAGE_SIZE allocations, alloc_pages() is just allocating a page and __free_page() is used instead of __free_pages() from below snippet. -- if (bytes >= PAGE_SIZE) page = alloc_pages(gfp, get_order(bytes)); . (rm->data.op_sg[i].length <= PAGE_SIZE) ? __free_page(sg_page(&rm->data.op_sg[i])) : __free_pages(sg_page(&rm->data.op_sg[i]), get_order(rm->data.op_sg[i].length)); Therefore, if there really is a loop here, one fix would be to periodically unwind back out to run_ksoftirqd(), but setting up so that the work would be continued later. Another fix might be to move this from tasklet context to workqueue context, where cond_resched_rcu_qs() can be used -- however, this looks a bit like networking code, which does not always take kindly to being run in process context (though careful use of local_bh_disable() and local_bh_enable() can sometimes overcome this issue). A third fix, which works only if this code does not use RCU and does not invoke any code that does use RCU, is to tell RCU that it should ignore this code (which will require a little work on RCU, as it currently does not tolerate this sort of thing aside from the idle threads). In this last approach, event-tracing calls must use the _nonidle suffix. I am not familiar with the RDS code, so I cannot be more specific. No worries. Since we saw the issue with XEN too, I was suspecting that somehow we didn't hav
[rcu_sched stall] regression/miss-config ?
Hi Paul, I was asking Sasha about [1] since other folks in Oracle also stumbled upon similar RCU stalls with v4.1 kernel in different workloads. I was reported similar issue with RDS as well and looking at [1], [2], [3] and [4], thought of reaching out to see if you can help us to understand this issue better. Have also included RCU specific config used in these test(s). Its very hard to reproduce the issue but one of the data point is, it reproduces on systems with larger CPUs(64+). Same workload with less than 64 CPUs, don't show the issue. Someone also told me, making use of SLAB instead SLUB allocator makes difference but I haven't verified that part for RDS. Let me know your thoughts. Thanks in advance !! Regards, Santosh [1] https://lkml.org/lkml/2014/12/14/304 [2] log 1: http://pastebin.uk.oracle.com/iUr9qE [3] log 2: http://pastebin.uk.oracle.com/Oe3cr5 [4] log 3: http://pastebin.uk.oracle.com/bMYLkD [5] rcu config: http://pastebin.uk.oracle.com/e7NXTW
Re: [PATCH] gpio: omap: fix irq triggering in smart-idle wakeup mode
On 4/18/2016 4:36 PM, Tony Lindgren wrote: * Grygorii Strashko [160418 08:59]: On 04/15/2016 09:54 PM, Tony Lindgren wrote: * santosh shilimkar [160415 08:22]: On 4/15/2016 2:26 AM, Grygorii Strashko wrote: Santosh, Tony, do you want me to perform any additional actions regarding this patch? This patch should be run across family of SOCs to make sure wakeup works on all of those if not done already Also, I'm not sure if we can just drop this code in question. After this patch, what function updates the GPIO wkup_en registers depending on enable_irq_wake()/disable_irq_wake()? The main purpose of this patch is to *not* modify GPIO wkup_en registers depending of enable_irq_wake()/disable_irq_wake() :), instead all non wake up IRQs should be masked during suspend. OK that makes sense. The GPIO wkup_en registers should be always in sync with GPIO irq_en when GPIO IP is in smart-idle wakeup mode. And this is done now from omap_gpio_unmask_irq/omap_gpio_mask_irq(). See also [1]. In general, it is more or less similar to GIC + wakeupgen: - during normal work (including cpuidle) GIC irq_en and Wakeupgen wkup_en should be in sync always - during suspend - only IRQs, marked as wake up sources, should be left unmasked. Also, I've found old thread [2] where Santosh proposed to use IRQCHIP_MASK_ON_SUSPEND. And it was not possible, at that time, but now IRQCHIP_MASK_ON_SUSPEND can be used :), because OMAP GPIO driver was switched to use generic irq handler instead of chained, so now OMAP GPIO irqs are properly handled by IRQ PM core. [chained irqs (and chained irq handles) are not disabled during suspend/resume and they are not maintained by IRQ PM core as result they can trigger way too early on resume when OMAP GPIO is not ready/powered.] OK. For my tests this patch does not change anything. I noticed however that we still have some additional bug somewhere where GPIO wake up events work fine for omap3 PM runtime, but are flakey for suspend. I've tested it on: am57x-evm, am437x-idk-evm, omap4-panda OK thanks! Based on my tests and the above: Acked-by: Tony Lindgren If all works then consider my ack as well :-)
Re: [net][PATCH v2 0/2] RDS: couple of fixes for 4.6
On 4/16/2016 3:53 PM, David Miller wrote: From: Santosh Shilimkar Date: Thu, 14 Apr 2016 10:43:25 -0700 git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git for_4.6/net/rds-fixes I have no idea how you set this up, but there is no WAY this can be pulled from by me. Thought I did based it against 'net' after your last comment. Just checked again and the 'net' remote added by me points to wrong url(net-next). When I try to pull it into 'net' I get 2690 objects. That means you didn't base it upon the 'net' tree which you must do. You can't base it upon Linus's tree, because if you do I'll get a ton of changes that are absolutely not appropriate to be pulled into my 'net' tree. Are you always doing this? Working against Linus's tree instead of mine? No, its not Linus's tree. Its yours but not the right one. Sorry for the trouble. Won't happen again. Thanks for picking up the matches from patchworks. Regards, Santosh
Re: [PATCH] gpio: omap: fix irq triggering in smart-idle wakeup mode
On 4/15/2016 2:26 AM, Grygorii Strashko wrote: On 04/15/2016 11:32 AM, Linus Walleij wrote: On Tue, Apr 12, 2016 at 12:52 PM, Grygorii Strashko wrote: Now GPIO IRQ loss is observed on dra7-evm after suspend/resume cycle (...) Cc: Roger Quadros Signed-off-by: Grygorii Strashko Can I get some explicit ACK / Tested-by tags for this patch? Roger's promised to test it once suspend regression will be fixed for dra7-evm, probably next rc. Is it a serious regression that will need to go in as a fix and tagged for stable? This issue is here since 2012, so I think it's not very critical - It seems bits combination which causing the issue is rare. Regarding stable: 4.4 - good to have, simple merge conflict 4.1 - some merge resolution is required older kernel - it will be hard to backport it due to significant changes in omap gpio driver Santosh, Tony, do you want me to perform any additional actions regarding this patch? This patch should be run across family of SOCs to make sure wakeup works on all of those if not done already
[net][PATCH v2 2/2] RDS: Fix the atomicity for congestion map update
Two different threads with different rds sockets may be in rds_recv_rcvbuf_delta() via receive path. If their ports both map to the same word in the congestion map, then using non-atomic ops to update it could cause the map to be incorrect. Lets use atomics to avoid such an issue. Full credit to Wengang for finding the issue, analysing it and also pointing out to offending code with spin lock based fix. Reviewed-by: Leon Romanovsky Signed-off-by: Wengang Wang Signed-off-by: Santosh Shilimkar --- net/rds/cong.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/rds/cong.c b/net/rds/cong.c index e6144b8..6641bcf 100644 --- a/net/rds/cong.c +++ b/net/rds/cong.c @@ -299,7 +299,7 @@ void rds_cong_set_bit(struct rds_cong_map *map, __be16 port) i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS; off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS; - __set_bit_le(off, (void *)map->m_page_addrs[i]); + set_bit_le(off, (void *)map->m_page_addrs[i]); } void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port) @@ -313,7 +313,7 @@ void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port) i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS; off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS; - __clear_bit_le(off, (void *)map->m_page_addrs[i]); + clear_bit_le(off, (void *)map->m_page_addrs[i]); } static int rds_cong_test_bit(struct rds_cong_map *map, __be16 port) -- 1.9.1
[net][PATCH v2 0/2] RDS: couple of fixes for 4.6
v2: Rebased fixes against 'net' instead of 'net-next' Patches are also available at below git tree. The following changes since commit e013b7780c41b471c4269ac9ccafb65ba7c9ec86: Merge branch 'dsa-voidify-ops' (2016-04-08 16:51:15 -0400) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git for_4.6/net/rds-fixes for you to fetch changes up to e9155afb1902380938ca83ba8504aaa2d7ee5210: RDS: Fix the atomicity for congestion map update (2016-04-08 15:08:13 -0700) Qing Huang (1): RDS: fix endianness for dp_ack_seq Santosh Shilimkar (1): RDS: Fix the atomicity for congestion map update net/rds/cong.c | 4 ++-- net/rds/ib_cm.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) Regards, Santosh
[net][PATCH v2 1/2] RDS: fix endianness for dp_ack_seq
From: Qing Huang dp->dp_ack_seq is used in big endian format. We need to do the big endianness conversion when we assign a value in host format to it. Signed-off-by: Qing Huang Signed-off-by: Santosh Shilimkar --- net/rds/ib_cm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index 8764970..310cabc 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -194,7 +194,7 @@ static void rds_ib_cm_fill_conn_param(struct rds_connection *conn, dp->dp_protocol_major = RDS_PROTOCOL_MAJOR(protocol_version); dp->dp_protocol_minor = RDS_PROTOCOL_MINOR(protocol_version); dp->dp_protocol_minor_mask = cpu_to_be16(RDS_IB_SUPPORTED_PROTOCOLS); - dp->dp_ack_seq = rds_ib_piggyb_ack(ic); + dp->dp_ack_seq = cpu_to_be64(rds_ib_piggyb_ack(ic)); /* Advertise flow control */ if (ic->i_flowctl) { -- 1.9.1
Re: [PATCH] ARM: dts: keystone: Add aliases for SPI nodes
On 4/13/2016 3:52 AM, Vignesh R wrote: Add aliases for SPI nodes, this is required to probe the SPI devices in U-Boot. Signed-off-by: Vignesh R --- Applied. Thanks !!
Re: [PATCH] gpio: omap: fix irq triggering in smart-idle wakeup mode
On 4/12/2016 3:52 AM, Grygorii Strashko wrote: Now GPIO IRQ loss is observed on dra7-evm after suspend/resume cycle in the following case: extcon_usb1(id_irq) -> pcf8575.gpio1 -> omapgpio6.gpio11 -> gic the extcon_usb1 is wake up source and it enables IRQ wake up for id_irq by calling enable/disable_irq_wake() during suspend/resume which, in turn, causes execution of omap_gpio_wake_enable(). And omap_gpio_wake_enable() will set/clear corresponding bit in GPIO_IRQWAKEN_x register. omapgpio6 configuration after boot - wakeup is enabled for GPIO IRQs by default from omap_gpio_irq_type: GPIO_IRQSTATUS_SET_0| 0x0400 GPIO_IRQSTATUS_CLR_0| 0x0400 GPIO_IRQWAKEN_0 | 0x0400 GPIO_RISINGDETECT | 0x GPIO_FALLINGDETECT | 0x0400 omapgpio6 configuration after after suspend/resume cycle: GPIO_IRQSTATUS_SET_0| 0x0400 GPIO_IRQSTATUS_CLR_0| 0x0400 GPIO_IRQWAKEN_0 | 0x <--- GPIO_RISINGDETECT | 0x GPIO_FALLINGDETECT | 0x0400 As result, system will start to lose interrupts from pcf8575 GPIO expander, because when OMAP GPIO IP is in smart-idle wakeup mode, there is no guarantee that transition(s) on input non wake up GPIO pin will trigger asynchronous wake-up request to PRCM and then IRQ generation. IRQ will be generated when GPIO is in active mode - for example, some time after accessing GPIO bank registers IRQs will be generated normally, but issue will happen again once PRCM will put GPIO in low power smart-idle wakeup mode. Note 1. Issue is not reproduced if debounce clk is enabled for GPIO bank. Note 2. Issue hardly reproducible if GPIO pins group contains both wakeup/non-wakeup gpios - for example, it will be hard to reproduce issue with pin2 if GPIO_IRQWAKEN_0=0x1 GPIO_IRQSTATUS_SET_0=0x3 GPIO_FALLINGDETECT = 0x3 (TRM "Power Saving by Grouping the Edge/Level Detection"). Note 3. There nothing common bitween System wake up and OMAP GPIO bank IP wake up logic - the last one defines how the GPIO bank ON-IDLE-ON transition will happen inside SoC under control of PRCM. Hence, fix the problem by removing omap_set_gpio_wakeup() function completely and so keeping always in sync GPIO IRQ mask/unmask (IRQSTATUS_SET) and wake up enable (GPIO_IRQWAKEN) bits; and adding IRQCHIP_MASK_ON_SUSPEND flag in OMAP GPIO irqchip. That way non wakeup GPIO IRQs will be properly masked/unmask by IRQ PM core during suspend/resume cycle. Cc: Roger Quadros Signed-off-by: Grygorii Strashko --- GPIO IP has two levels of controls for wakeups and you are just removing the SYSCFG wakeup and relying on the IRQ line wakeup. I like usage of "IRQCHIP_MASK_ON_SUSPEND" but please be acreful this change which might break older OMAPs. drivers/gpio/gpio-omap.c | 42 ++ 1 file changed, 2 insertions(+), 40 deletions(-) diff --git a/drivers/gpio/gpio-omap.c b/drivers/gpio/gpio-omap.c index 551dfa9..b98ede7 100644 --- a/drivers/gpio/gpio-omap.c +++ b/drivers/gpio/gpio-omap.c @@ -611,51 +611,12 @@ static inline void omap_set_gpio_irqenable(struct gpio_bank *bank, omap_disable_gpio_irqbank(bank, BIT(offset)); } -/* - * Note that ENAWAKEUP needs to be enabled in GPIO_SYSCONFIG register. - * 1510 does not seem to have a wake-up register. If JTAG is connected - * to the target, system will wake up always on GPIO events. While - * system is running all registered GPIO interrupts need to have wake-up - * enabled. When system is suspended, only selected GPIO interrupts need - * to have wake-up enabled. - */ -static int omap_set_gpio_wakeup(struct gpio_bank *bank, unsigned offset, - int enable) -{ - u32 gpio_bit = BIT(offset); - unsigned long flags; - - if (bank->non_wakeup_gpios & gpio_bit) { - dev_err(bank->chip.parent, - "Unable to modify wakeup on non-wakeup GPIO%d\n", - offset); - return -EINVAL; - } - - raw_spin_lock_irqsave(&bank->lock, flags); - if (enable) - bank->context.wake_en |= gpio_bit; - else - bank->context.wake_en &= ~gpio_bit; - - writel_relaxed(bank->context.wake_en, bank->base + bank->regs->wkup_en); - raw_spin_unlock_irqrestore(&bank->lock, flags); - - return 0; -} - /* Use disable_irq_wake() and enable_irq_wake() functions from drivers */ static int omap_gpio_wake_enable(struct irq_data *d, unsigned int enable) { struct gpio_bank *bank = omap_irq_data_get_bank(d); - unsigned offset = d->hwirq; - int ret; - ret = omap_set_gpio_wakeup(bank, offset, enable); - if (!ret) - ret = irq_set_irq_wake(bank->irq, enable); - - return ret; + return irq_set_irq_wake(bank->irq, enable); } static int omap_gpio_request(struct gpio_chip *chip, unsigned offset) @@ -1187,6 +1148,7 @@ static int omap_gpio_probe(struct platform_device
[net-next][PATCH 0/2] RDS: couple of fixes for 4.6
Patches are also available at below git tree. git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git for_4.6/net-next/rds-fixes Qing Huang (1): RDS: fix endianness for dp_ack_seq Santosh Shilimkar (1): RDS: Fix the atomicity for congestion map update net/rds/cong.c | 4 ++-- net/rds/ib_cm.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) -- 1.9.1
[net-next][PATCH 2/2] RDS: Fix the atomicity for congestion map update
Two different threads with different rds sockets may be in rds_recv_rcvbuf_delta() via receive path. If their ports both map to the same word in the congestion map, then using non-atomic ops to update it could cause the map to be incorrect. Lets use atomics to avoid such an issue. Full credit to Wengang for finding the issue, analysing it and also pointing out to offending code with spin lock based fix. Reviewed-by: Leon Romanovsky Signed-off-by: Wengang Wang Signed-off-by: Santosh Shilimkar --- net/rds/cong.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/rds/cong.c b/net/rds/cong.c index e6144b8..6641bcf 100644 --- a/net/rds/cong.c +++ b/net/rds/cong.c @@ -299,7 +299,7 @@ void rds_cong_set_bit(struct rds_cong_map *map, __be16 port) i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS; off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS; - __set_bit_le(off, (void *)map->m_page_addrs[i]); + set_bit_le(off, (void *)map->m_page_addrs[i]); } void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port) @@ -313,7 +313,7 @@ void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port) i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS; off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS; - __clear_bit_le(off, (void *)map->m_page_addrs[i]); + clear_bit_le(off, (void *)map->m_page_addrs[i]); } static int rds_cong_test_bit(struct rds_cong_map *map, __be16 port) -- 1.9.1
[net-next][PATCH 1/2] RDS: fix endianness for dp_ack_seq
From: Qing Huang dp->dp_ack_seq is used in big endian format. We need to do the big endianness conversion when we assign a value in host format to it. Signed-off-by: Qing Huang Signed-off-by: Santosh Shilimkar --- net/rds/ib_cm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index 8764970..310cabc 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -194,7 +194,7 @@ static void rds_ib_cm_fill_conn_param(struct rds_connection *conn, dp->dp_protocol_major = RDS_PROTOCOL_MAJOR(protocol_version); dp->dp_protocol_minor = RDS_PROTOCOL_MINOR(protocol_version); dp->dp_protocol_minor_mask = cpu_to_be16(RDS_IB_SUPPORTED_PROTOCOLS); - dp->dp_ack_seq = rds_ib_piggyb_ack(ic); + dp->dp_ack_seq = cpu_to_be64(rds_ib_piggyb_ack(ic)); /* Advertise flow control */ if (ic->i_flowctl) { -- 1.9.1
Re: [PATCH v3 2/2] usb:dwc3: pass arch data to xhci-hcd child
On 4/3/2016 11:28 PM, Felipe Balbi wrote: santosh shilimkar writes: +Arnd, RMK, On 4/1/2016 4:57 AM, Felipe Balbi wrote: Hi, Grygorii Strashko writes: On 04/01/2016 01:20 PM, Felipe Balbi wrote: [...] commit 7ace8fc8219e4cbbfd5b4790390d9a01a2541cdf Author: Yoshihiro Shimoda Date: Mon Jul 13 18:10:05 2015 +0900 usb: gadget: udc: core: Fix argument of dma_map_single for IOMMU The dma_map_single and dma_unmap_single should set "gadget->dev.parent" instead of "&gadget->dev" in the first argument because the parent has a udc controller's device pointer. Otherwise, iommu functions are not called in ARM environment. Signed-off-by: Yoshihiro Shimoda Signed-off-by: Felipe Balbi Above actually means that DMA configuration code can be dropped from usb_add_gadget_udc_release() completely. Right?: true, but now I'm not sure what's better: copy all necessary bits from parent or just pass the parent device to all DMA API. Anybody to shed a light here ? The expectation is drivers should pass the proper dev pointers and let core DMA code deal with it since it knows the per device dma properties. okay, so how do you get proper DMA pointers with something like this: kdwc3_dma_mask = dma_get_mask(dev); dev->dma_mask = &kdwc3_dma_mask; This doesn't anything. Drivers actually needs to touch dma_mask(s) only if the core DMA code hasn't populated it it. I see Grygorii pointed out couple of things already. Reagrds, Santosh
Re: [PATCH] ARM: OMAP: wakeupgen: Add comment for unhandled FROZEN transitions
On 4/4/2016 5:55 AM, Anna-Maria Gleixner wrote: FROZEN hotplug notifiers are not handled and do not have to be. Insert a comment to remember that the lack of the FROZEN transitions is no accident. Cc: Tony Lindgren Cc: Santosh Shilimkar Cc: linux-o...@vger.kernel.org Signed-off-by: Anna-Maria Gleixner --- Acked-by: Santosh Shilimkar
Re: [PATCH v3 2/2] usb:dwc3: pass arch data to xhci-hcd child
+Arnd, RMK, On 4/1/2016 4:57 AM, Felipe Balbi wrote: Hi, Grygorii Strashko writes: On 04/01/2016 01:20 PM, Felipe Balbi wrote: [...] commit 7ace8fc8219e4cbbfd5b4790390d9a01a2541cdf Author: Yoshihiro Shimoda Date: Mon Jul 13 18:10:05 2015 +0900 usb: gadget: udc: core: Fix argument of dma_map_single for IOMMU The dma_map_single and dma_unmap_single should set "gadget->dev.parent" instead of "&gadget->dev" in the first argument because the parent has a udc controller's device pointer. Otherwise, iommu functions are not called in ARM environment. Signed-off-by: Yoshihiro Shimoda Signed-off-by: Felipe Balbi Above actually means that DMA configuration code can be dropped from usb_add_gadget_udc_release() completely. Right?: true, but now I'm not sure what's better: copy all necessary bits from parent or just pass the parent device to all DMA API. Anybody to shed a light here ? The expectation is drivers should pass the proper dev pointers and let core DMA code deal with it since it knows the per device dma properties. RMK did massive series of patches to fix many drivers which were not adhering to dma APIs. Regrds, Santosh
Re: [PATCH] ARM: dts: k2*: Rename the k2* files to keystone-k2* files
On 3/16/2016 7:39 AM, Nishanth Menon wrote: As reported in [1], rename the k2* dts files to keystone-* files this will force consistency throughout. Script for the same (and hand modified for Makefile and MAINTAINERS files): for i in arch/arm/boot/dts/k2* do b=`basename $i`; git mv $i arch/arm/boot/dts/keystone-$b; sed -i -e "s/$b/keystone-$b/g" arch/arm/boot/dts/*[si] done NOTE: bootloaders that depend on older dtb names will need to be updated as well. [1] http://marc.info/?l=linux-arm-kernel&m=145637407804754&w=2 Reported-by: Olof Johansson Signed-off-by: Nishanth Menon --- Thanks Nishant for taking care of this. I will add this to the next soon. Reagrds, Santosh
Re: [PATCH] gpio: omap: drop dev field from gpio_bank structure
On 3/4/2016 7:25 AM, Grygorii Strashko wrote: GPIO chip structure already has "parent" field which is used for the same purpose as "dev" field in gpio_bank structure - store pointer on GPIO device. Hence, drop duplicated "dev" field from gpio_bank structure. Signed-off-by: Grygorii Strashko --- Looks good. Acked-by: Santosh Shilimkar
Re: RDS: Major clean-up with couple of new features for 4.6
On 3/2/2016 11:13 AM, David Miller wrote: From: Santosh Shilimkar Date: Tue, 1 Mar 2016 15:20:41 -0800 v3: Re-generated the same series by omitting "-D" option from git format-patch command. Since first patch has file removals, git apply/am can't deal with it when formated with '-D' option. Yeah this works much better, series applied, thanks. Thanks Dave !! Regards, Santosh
[net-next][PATCH v3 05/13] RDS: IB: Re-organise ibmr code
No functional changes. This is in preperation towards adding fastreg memory resgitration support. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/Makefile | 2 +- net/rds/ib.c | 37 +++--- net/rds/ib.h | 25 +--- net/rds/ib_fmr.c | 217 +++ net/rds/ib_mr.h | 109 net/rds/ib_rdma.c | 379 +++--- 6 files changed, 422 insertions(+), 347 deletions(-) create mode 100644 net/rds/ib_fmr.c create mode 100644 net/rds/ib_mr.h diff --git a/net/rds/Makefile b/net/rds/Makefile index 19e5485..bcf5591 100644 --- a/net/rds/Makefile +++ b/net/rds/Makefile @@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o message.o \ obj-$(CONFIG_RDS_RDMA) += rds_rdma.o rds_rdma-y := rdma_transport.o \ ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \ - ib_sysctl.o ib_rdma.o + ib_sysctl.o ib_rdma.o ib_fmr.o obj-$(CONFIG_RDS_TCP) += rds_tcp.o diff --git a/net/rds/ib.c b/net/rds/ib.c index 9481d55..bb32cb9 100644 --- a/net/rds/ib.c +++ b/net/rds/ib.c @@ -42,15 +42,16 @@ #include "rds.h" #include "ib.h" +#include "ib_mr.h" -unsigned int rds_ib_fmr_1m_pool_size = RDS_FMR_1M_POOL_SIZE; -unsigned int rds_ib_fmr_8k_pool_size = RDS_FMR_8K_POOL_SIZE; +unsigned int rds_ib_mr_1m_pool_size = RDS_MR_1M_POOL_SIZE; +unsigned int rds_ib_mr_8k_pool_size = RDS_MR_8K_POOL_SIZE; unsigned int rds_ib_retry_count = RDS_IB_DEFAULT_RETRY_COUNT; -module_param(rds_ib_fmr_1m_pool_size, int, 0444); -MODULE_PARM_DESC(rds_ib_fmr_1m_pool_size, " Max number of 1M fmr per HCA"); -module_param(rds_ib_fmr_8k_pool_size, int, 0444); -MODULE_PARM_DESC(rds_ib_fmr_8k_pool_size, " Max number of 8K fmr per HCA"); +module_param(rds_ib_mr_1m_pool_size, int, 0444); +MODULE_PARM_DESC(rds_ib_mr_1m_pool_size, " Max number of 1M mr per HCA"); +module_param(rds_ib_mr_8k_pool_size, int, 0444); +MODULE_PARM_DESC(rds_ib_mr_8k_pool_size, " Max number of 8K mr per HCA"); module_param(rds_ib_retry_count, int, 0444); MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before reporting an error"); @@ -140,13 +141,13 @@ static void rds_ib_add_one(struct ib_device *device) rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE); rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32; - rds_ibdev->max_1m_fmrs = device->attrs.max_mr ? + rds_ibdev->max_1m_mrs = device->attrs.max_mr ? min_t(unsigned int, (device->attrs.max_mr / 2), - rds_ib_fmr_1m_pool_size) : rds_ib_fmr_1m_pool_size; + rds_ib_mr_1m_pool_size) : rds_ib_mr_1m_pool_size; - rds_ibdev->max_8k_fmrs = device->attrs.max_mr ? + rds_ibdev->max_8k_mrs = device->attrs.max_mr ? min_t(unsigned int, ((device->attrs.max_mr / 2) * RDS_MR_8K_SCALE), - rds_ib_fmr_8k_pool_size) : rds_ib_fmr_8k_pool_size; + rds_ib_mr_8k_pool_size) : rds_ib_mr_8k_pool_size; rds_ibdev->max_initiator_depth = device->attrs.max_qp_init_rd_atom; rds_ibdev->max_responder_resources = device->attrs.max_qp_rd_atom; @@ -172,10 +173,10 @@ static void rds_ib_add_one(struct ib_device *device) goto put_dev; } - rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, fmr_max_remaps = %d, max_1m_fmrs = %d, max_8k_fmrs = %d\n", + rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, fmr_max_remaps = %d, max_1m_mrs = %d, max_8k_mrs = %d\n", device->attrs.max_fmr, rds_ibdev->max_wrs, rds_ibdev->max_sge, -rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_fmrs, -rds_ibdev->max_8k_fmrs); +rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs, +rds_ibdev->max_8k_mrs); INIT_LIST_HEAD(&rds_ibdev->ipaddr_list); INIT_LIST_HEAD(&rds_ibdev->conn_list); @@ -364,7 +365,7 @@ void rds_ib_exit(void) rds_ib_sysctl_exit(); rds_ib_recv_exit(); rds_trans_unregister(&rds_ib_transport); - rds_ib_fmr_exit(); + rds_ib_mr_exit(); } struct rds_transport rds_ib_transport = { @@ -400,13 +401,13 @@ int rds_ib_init(void) INIT_LIST_HEAD(&rds_ib_devices); - ret = rds_ib_fmr_init(); + ret = rds_ib_mr_init(); if (ret) goto out; ret = ib_register_client(&rds_ib_client); if (ret) - goto out_fmr_exit; + goto out_mr_exit; ret = rds_ib_sysctl_init(); if (ret) @@ -430,8 +431,8 @@ out_sysctl: rds_ib_sysctl_exit(); out_ibreg: rds_ib_unregister_client(); -out_fmr_exi
[net-next][PATCH v3 07/13] RDS: IB: move FMR code to its own file
No functional change. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib_fmr.c | 126 +- net/rds/ib_mr.h | 6 +++ net/rds/ib_rdma.c | 108 ++ 3 files changed, 134 insertions(+), 106 deletions(-) diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c index 74f2c21..4fe8f4f 100644 --- a/net/rds/ib_fmr.c +++ b/net/rds/ib_fmr.c @@ -37,61 +37,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev, int npages) struct rds_ib_mr_pool *pool; struct rds_ib_mr *ibmr = NULL; struct rds_ib_fmr *fmr; - int err = 0, iter = 0; + int err = 0; if (npages <= RDS_MR_8K_MSG_SIZE) pool = rds_ibdev->mr_8k_pool; else pool = rds_ibdev->mr_1m_pool; - if (atomic_read(&pool->dirty_count) >= pool->max_items / 10) - queue_delayed_work(rds_ib_mr_wq, &pool->flush_worker, 10); - - /* Switch pools if one of the pool is reaching upper limit */ - if (atomic_read(&pool->dirty_count) >= pool->max_items * 9 / 10) { - if (pool->pool_type == RDS_IB_MR_8K_POOL) - pool = rds_ibdev->mr_1m_pool; - else - pool = rds_ibdev->mr_8k_pool; - } - - while (1) { - ibmr = rds_ib_reuse_mr(pool); - if (ibmr) - return ibmr; - - /* No clean MRs - now we have the choice of either -* allocating a fresh MR up to the limit imposed by the -* driver, or flush any dirty unused MRs. -* We try to avoid stalling in the send path if possible, -* so we allocate as long as we're allowed to. -* -* We're fussy with enforcing the FMR limit, though. If the -* driver tells us we can't use more than N fmrs, we shouldn't -* start arguing with it -*/ - if (atomic_inc_return(&pool->item_count) <= pool->max_items) - break; - - atomic_dec(&pool->item_count); - - if (++iter > 2) { - if (pool->pool_type == RDS_IB_MR_8K_POOL) - rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_depleted); - else - rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_depleted); - return ERR_PTR(-EAGAIN); - } - - /* We do have some empty MRs. Flush them out. */ - if (pool->pool_type == RDS_IB_MR_8K_POOL) - rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_wait); - else - rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_wait); - rds_ib_flush_mr_pool(pool, 0, &ibmr); - if (ibmr) - return ibmr; - } + ibmr = rds_ib_try_reuse_ibmr(pool); + if (ibmr) + return ibmr; ibmr = kzalloc_node(sizeof(*ibmr), GFP_KERNEL, rdsibdev_to_node(rds_ibdev)); @@ -218,3 +173,76 @@ out: return ret; } + +struct rds_ib_mr *rds_ib_reg_fmr(struct rds_ib_device *rds_ibdev, +struct scatterlist *sg, +unsigned long nents, +u32 *key) +{ + struct rds_ib_mr *ibmr = NULL; + struct rds_ib_fmr *fmr; + int ret; + + ibmr = rds_ib_alloc_fmr(rds_ibdev, nents); + if (IS_ERR(ibmr)) + return ibmr; + + ibmr->device = rds_ibdev; + fmr = &ibmr->u.fmr; + ret = rds_ib_map_fmr(rds_ibdev, ibmr, sg, nents); + if (ret == 0) + *key = fmr->fmr->rkey; + else + rds_ib_free_mr(ibmr, 0); + + return ibmr; +} + +void rds_ib_unreg_fmr(struct list_head *list, unsigned int *nfreed, + unsigned long *unpinned, unsigned int goal) +{ + struct rds_ib_mr *ibmr, *next; + struct rds_ib_fmr *fmr; + LIST_HEAD(fmr_list); + int ret = 0; + unsigned int freed = *nfreed; + + /* String all ib_mr's onto one list and hand them to ib_unmap_fmr */ + list_for_each_entry(ibmr, list, unmap_list) { + fmr = &ibmr->u.fmr; + list_add(&fmr->fmr->list, &fmr_list); + } + + ret = ib_unmap_fmr(&fmr_list); + if (ret) + pr_warn("RDS/IB: FMR invalidation failed (err=%d)\n", ret); + + /* Now we can destroy the DMA mapping and unpin any pages */ + list_for_each_entry_safe(ibmr, next, list, unmap_list) { + fmr = &ibmr->u.fmr; + *unpinned += ibmr->sg_len; +
[net-next][PATCH v3 12/13] RDS: IB: allocate extra space on queues for FRMR support
Fastreg MR(FRMR) memory registration and invalidation makes use of work request and completion queues for its operation. Patch allocates extra queue space towards these operation(s). Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib.h| 4 net/rds/ib_cm.c | 16 2 files changed, 16 insertions(+), 4 deletions(-) diff --git a/net/rds/ib.h b/net/rds/ib.h index c5eddc2..eeb0d6c 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -14,6 +14,7 @@ #define RDS_IB_DEFAULT_RECV_WR 1024 #define RDS_IB_DEFAULT_SEND_WR 256 +#define RDS_IB_DEFAULT_FR_WR 512 #define RDS_IB_DEFAULT_RETRY_COUNT 2 @@ -122,6 +123,9 @@ struct rds_ib_connection { struct ib_wci_send_wc[RDS_IB_WC_MAX]; struct ib_wci_recv_wc[RDS_IB_WC_MAX]; + /* To control the number of wrs from fastreg */ + atomic_ti_fastreg_wrs; + /* interrupt handling */ struct tasklet_struct i_send_tasklet; struct tasklet_struct i_recv_tasklet; diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index 7f68abc..83f4673 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -363,7 +363,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) struct ib_qp_init_attr attr; struct ib_cq_init_attr cq_attr = {}; struct rds_ib_device *rds_ibdev; - int ret; + int ret, fr_queue_space; /* * It's normal to see a null device if an incoming connection races @@ -373,6 +373,12 @@ static int rds_ib_setup_qp(struct rds_connection *conn) if (!rds_ibdev) return -EOPNOTSUPP; + /* The fr_queue_space is currently set to 512, to add extra space on +* completion queue and send queue. This extra space is used for FRMR +* registration and invalidation work requests +*/ + fr_queue_space = (rds_ibdev->use_fastreg ? RDS_IB_DEFAULT_FR_WR : 0); + /* add the conn now so that connection establishment has the dev */ rds_ib_add_conn(rds_ibdev, conn); @@ -384,7 +390,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) /* Protection domain and memory range */ ic->i_pd = rds_ibdev->pd; - cq_attr.cqe = ic->i_send_ring.w_nr + 1; + cq_attr.cqe = ic->i_send_ring.w_nr + fr_queue_space + 1; ic->i_send_cq = ib_create_cq(dev, rds_ib_cq_comp_handler_send, rds_ib_cq_event_handler, conn, @@ -424,7 +430,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) attr.event_handler = rds_ib_qp_event_handler; attr.qp_context = conn; /* + 1 to allow for the single ack message */ - attr.cap.max_send_wr = ic->i_send_ring.w_nr + 1; + attr.cap.max_send_wr = ic->i_send_ring.w_nr + fr_queue_space + 1; attr.cap.max_recv_wr = ic->i_recv_ring.w_nr + 1; attr.cap.max_send_sge = rds_ibdev->max_sge; attr.cap.max_recv_sge = RDS_IB_RECV_SGE; @@ -432,6 +438,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) attr.qp_type = IB_QPT_RC; attr.send_cq = ic->i_send_cq; attr.recv_cq = ic->i_recv_cq; + atomic_set(&ic->i_fastreg_wrs, RDS_IB_DEFAULT_FR_WR); /* * XXX this can fail if max_*_wr is too large? Are we supposed @@ -751,7 +758,8 @@ void rds_ib_conn_shutdown(struct rds_connection *conn) */ wait_event(rds_ib_ring_empty_wait, rds_ib_ring_empty(&ic->i_recv_ring) && - (atomic_read(&ic->i_signaled_sends) == 0)); + (atomic_read(&ic->i_signaled_sends) == 0) && + (atomic_read(&ic->i_fastreg_wrs) == RDS_IB_DEFAULT_FR_WR)); tasklet_kill(&ic->i_send_tasklet); tasklet_kill(&ic->i_recv_tasklet); -- 1.9.1
[net-next][PATCH v3 04/13] RDS: IB: Remove the RDS_IB_SEND_OP dependency
This helps to combine asynchronous fastreg MR completion handler with send completion handler. No functional change. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib.h | 1 - net/rds/ib_cm.c | 42 +++--- net/rds/ib_send.c | 6 ++ 3 files changed, 29 insertions(+), 20 deletions(-) diff --git a/net/rds/ib.h b/net/rds/ib.h index b3fdebb..09cd8e3 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -28,7 +28,6 @@ #define RDS_IB_RECYCLE_BATCH_COUNT 32 #define RDS_IB_WC_MAX 32 -#define RDS_IB_SEND_OP BIT_ULL(63) extern struct rw_semaphore rds_ib_devices_lock; extern struct list_head rds_ib_devices; diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index da5a7fb..7f68abc 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -236,12 +236,10 @@ static void rds_ib_cq_comp_handler_recv(struct ib_cq *cq, void *context) tasklet_schedule(&ic->i_recv_tasklet); } -static void poll_cq(struct rds_ib_connection *ic, struct ib_cq *cq, - struct ib_wc *wcs, - struct rds_ib_ack_state *ack_state) +static void poll_scq(struct rds_ib_connection *ic, struct ib_cq *cq, +struct ib_wc *wcs) { - int nr; - int i; + int nr, i; struct ib_wc *wc; while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) { @@ -251,10 +249,7 @@ static void poll_cq(struct rds_ib_connection *ic, struct ib_cq *cq, (unsigned long long)wc->wr_id, wc->status, wc->byte_len, be32_to_cpu(wc->ex.imm_data)); - if (wc->wr_id & RDS_IB_SEND_OP) - rds_ib_send_cqe_handler(ic, wc); - else - rds_ib_recv_cqe_handler(ic, wc, ack_state); + rds_ib_send_cqe_handler(ic, wc); } } } @@ -263,14 +258,12 @@ static void rds_ib_tasklet_fn_send(unsigned long data) { struct rds_ib_connection *ic = (struct rds_ib_connection *)data; struct rds_connection *conn = ic->conn; - struct rds_ib_ack_state state; rds_ib_stats_inc(s_ib_tasklet_call); - memset(&state, 0, sizeof(state)); - poll_cq(ic, ic->i_send_cq, ic->i_send_wc, &state); + poll_scq(ic, ic->i_send_cq, ic->i_send_wc); ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP); - poll_cq(ic, ic->i_send_cq, ic->i_send_wc, &state); + poll_scq(ic, ic->i_send_cq, ic->i_send_wc); if (rds_conn_up(conn) && (!test_bit(RDS_LL_SEND_FULL, &conn->c_flags) || @@ -278,6 +271,25 @@ static void rds_ib_tasklet_fn_send(unsigned long data) rds_send_xmit(ic->conn); } +static void poll_rcq(struct rds_ib_connection *ic, struct ib_cq *cq, +struct ib_wc *wcs, +struct rds_ib_ack_state *ack_state) +{ + int nr, i; + struct ib_wc *wc; + + while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) { + for (i = 0; i < nr; i++) { + wc = wcs + i; + rdsdebug("wc wr_id 0x%llx status %u byte_len %u imm_data %u\n", +(unsigned long long)wc->wr_id, wc->status, +wc->byte_len, be32_to_cpu(wc->ex.imm_data)); + + rds_ib_recv_cqe_handler(ic, wc, ack_state); + } + } +} + static void rds_ib_tasklet_fn_recv(unsigned long data) { struct rds_ib_connection *ic = (struct rds_ib_connection *)data; @@ -291,9 +303,9 @@ static void rds_ib_tasklet_fn_recv(unsigned long data) rds_ib_stats_inc(s_ib_tasklet_call); memset(&state, 0, sizeof(state)); - poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, &state); + poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, &state); ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED); - poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, &state); + poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, &state); if (state.ack_next_valid) rds_ib_set_ack(ic, state.ack_next, state.ack_required); diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c index eac30bf..f27d2c8 100644 --- a/net/rds/ib_send.c +++ b/net/rds/ib_send.c @@ -195,7 +195,7 @@ void rds_ib_send_init_ring(struct rds_ib_connection *ic) send->s_op = NULL; - send->s_wr.wr_id = i | RDS_IB_SEND_OP; + send->s_wr.wr_id = i; send->s_wr.sg_list = send->s_sge; send->s_wr.ex.imm_data = 0; @@ -263,9 +263,7 @@ void rds_ib_send_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc) oldest = rds_ib_ring_oldest(&ic->i_send_r
[net-next][PATCH v3 01/13] RDS: Drop stale iWARP RDMA transport
RDS iWarp support code has become stale and non testable. As indicated earlier, am dropping the support for it. If new iWarp user(s) shows up in future, we can adapat the RDS IB transprt for the special RDMA READ sink case. iWarp needs an MR for the RDMA READ sink. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- Documentation/networking/rds.txt | 4 +- net/rds/Kconfig | 7 +- net/rds/Makefile | 4 +- net/rds/iw.c | 312 - net/rds/iw.h | 398 net/rds/iw_cm.c | 769 -- net/rds/iw_rdma.c| 837 - net/rds/iw_recv.c| 904 net/rds/iw_ring.c| 169 --- net/rds/iw_send.c| 981 --- net/rds/iw_stats.c | 95 net/rds/iw_sysctl.c | 123 - net/rds/rdma_transport.c | 13 +- net/rds/rdma_transport.h | 5 - 14 files changed, 7 insertions(+), 4614 deletions(-) delete mode 100644 net/rds/iw.c delete mode 100644 net/rds/iw.h delete mode 100644 net/rds/iw_cm.c delete mode 100644 net/rds/iw_rdma.c delete mode 100644 net/rds/iw_recv.c delete mode 100644 net/rds/iw_ring.c delete mode 100644 net/rds/iw_send.c delete mode 100644 net/rds/iw_stats.c delete mode 100644 net/rds/iw_sysctl.c diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt index e1a3d59..9d219d8 100644 --- a/Documentation/networking/rds.txt +++ b/Documentation/networking/rds.txt @@ -19,9 +19,7 @@ to N*N if you use a connection-oriented socket transport like TCP. RDS is not Infiniband-specific; it was designed to support different transports. The current implementation used to support RDS over TCP as well -as IB. Work is in progress to support RDS over iWARP, and using DCE to -guarantee no dropped packets on Ethernet, it may be possible to use RDS over -UDP in the future. +as IB. The high-level semantics of RDS from the application's point of view are diff --git a/net/rds/Kconfig b/net/rds/Kconfig index f2c670b..bffde4b 100644 --- a/net/rds/Kconfig +++ b/net/rds/Kconfig @@ -4,14 +4,13 @@ config RDS depends on INET ---help--- The RDS (Reliable Datagram Sockets) protocol provides reliable, - sequenced delivery of datagrams over Infiniband, iWARP, - or TCP. + sequenced delivery of datagrams over Infiniband or TCP. config RDS_RDMA - tristate "RDS over Infiniband and iWARP" + tristate "RDS over Infiniband" depends on RDS && INFINIBAND && INFINIBAND_ADDR_TRANS ---help--- - Allow RDS to use Infiniband and iWARP as a transport. + Allow RDS to use Infiniband as a transport. This transport supports RDMA operations. config RDS_TCP diff --git a/net/rds/Makefile b/net/rds/Makefile index 56d3f60..19e5485 100644 --- a/net/rds/Makefile +++ b/net/rds/Makefile @@ -6,9 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o message.o \ obj-$(CONFIG_RDS_RDMA) += rds_rdma.o rds_rdma-y := rdma_transport.o \ ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \ - ib_sysctl.o ib_rdma.o \ - iw.o iw_cm.o iw_recv.o iw_ring.o iw_send.o iw_stats.o \ - iw_sysctl.o iw_rdma.o + ib_sysctl.o ib_rdma.o obj-$(CONFIG_RDS_TCP) += rds_tcp.o diff --git a/net/rds/iw.c b/net/rds/iw.c deleted file mode 100644 index f4a9fff..000 --- a/net/rds/iw.c +++ /dev/null @@ -1,312 +0,0 @@ -/* - * Copyright (c) 2006 Oracle. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - *copyright notice, this list of conditions and the following - *disclaimer. - * - * - Redistributions in binary form must reproduce the above - *copyright notice, this list of conditions and the following - *disclaimer in the documentation and/or other materials - *provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRI
[net-next][PATCH v3 06/13] RDS: IB: create struct rds_ib_fmr
Keep fmr related filed in its own struct. Fastreg MR structure will be added to the union. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib_fmr.c | 17 ++--- net/rds/ib_mr.h | 11 +-- net/rds/ib_rdma.c | 14 ++ 3 files changed, 29 insertions(+), 13 deletions(-) diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c index d4f200d..74f2c21 100644 --- a/net/rds/ib_fmr.c +++ b/net/rds/ib_fmr.c @@ -36,6 +36,7 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev, int npages) { struct rds_ib_mr_pool *pool; struct rds_ib_mr *ibmr = NULL; + struct rds_ib_fmr *fmr; int err = 0, iter = 0; if (npages <= RDS_MR_8K_MSG_SIZE) @@ -99,15 +100,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev, int npages) goto out_no_cigar; } - ibmr->fmr = ib_alloc_fmr(rds_ibdev->pd, + fmr = &ibmr->u.fmr; + fmr->fmr = ib_alloc_fmr(rds_ibdev->pd, (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ | IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_ATOMIC), &pool->fmr_attr); - if (IS_ERR(ibmr->fmr)) { - err = PTR_ERR(ibmr->fmr); - ibmr->fmr = NULL; + if (IS_ERR(fmr->fmr)) { + err = PTR_ERR(fmr->fmr); + fmr->fmr = NULL; pr_warn("RDS/IB: %s failed (err=%d)\n", __func__, err); goto out_no_cigar; } @@ -122,8 +124,8 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev, int npages) out_no_cigar: if (ibmr) { - if (ibmr->fmr) - ib_dealloc_fmr(ibmr->fmr); + if (fmr->fmr) + ib_dealloc_fmr(fmr->fmr); kfree(ibmr); } atomic_dec(&pool->item_count); @@ -134,6 +136,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct rds_ib_mr *ibmr, struct scatterlist *sg, unsigned int nents) { struct ib_device *dev = rds_ibdev->dev; + struct rds_ib_fmr *fmr = &ibmr->u.fmr; struct scatterlist *scat = sg; u64 io_addr = 0; u64 *dma_pages; @@ -190,7 +193,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct rds_ib_mr *ibmr, (dma_addr & PAGE_MASK) + j; } - ret = ib_map_phys_fmr(ibmr->fmr, dma_pages, page_cnt, io_addr); + ret = ib_map_phys_fmr(fmr->fmr, dma_pages, page_cnt, io_addr); if (ret) goto out; diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h index d88724f..309ad59 100644 --- a/net/rds/ib_mr.h +++ b/net/rds/ib_mr.h @@ -43,11 +43,15 @@ #define RDS_MR_8K_SCALE(256 / (RDS_MR_8K_MSG_SIZE + 1)) #define RDS_MR_8K_POOL_SIZE(RDS_MR_8K_SCALE * (8192 / 2)) +struct rds_ib_fmr { + struct ib_fmr *fmr; + u64 *dma; +}; + /* This is stored as mr->r_trans_private. */ struct rds_ib_mr { struct rds_ib_device*device; struct rds_ib_mr_pool *pool; - struct ib_fmr *fmr; struct llist_node llnode; @@ -57,8 +61,11 @@ struct rds_ib_mr { struct scatterlist *sg; unsigned intsg_len; - u64 *dma; int sg_dma_len; + + union { + struct rds_ib_fmr fmr; + } u; }; /* Our own little MR pool */ diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index c594519..9e608d9 100644 --- a/net/rds/ib_rdma.c +++ b/net/rds/ib_rdma.c @@ -334,6 +334,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, int free_all, struct rds_ib_mr **ibmr_ret) { struct rds_ib_mr *ibmr, *next; + struct rds_ib_fmr *fmr; struct llist_node *clean_nodes; struct llist_node *clean_tail; LIST_HEAD(unmap_list); @@ -395,8 +396,10 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, goto out; /* String all ib_mr's onto one list and hand them to ib_unmap_fmr */ - list_for_each_entry(ibmr, &unmap_list, unmap_list) - list_add(&ibmr->fmr->list, &fmr_list); + list_for_each_entry(ibmr, &unmap_list, unmap_list) { + fmr = &ibmr->u.fmr; + list_add(&fmr->fmr->list, &fmr_list); + } ret = ib_unmap_fmr(&fmr_list); if (ret) @@ -405,6 +408,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, /* Now we can destroy the DMA mapping and unpin any pages */ list_for_each_entry_safe(ibmr, next, &unmap_list, unmap_list) { unpinned += ibmr->sg_len; +
[net-next][PATCH v3 10/13] RDS: IB: add mr reused stats
Add MR reuse statistics to RDS IB transport. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib.h | 2 ++ net/rds/ib_rdma.c | 7 ++- net/rds/ib_stats.c | 2 ++ 3 files changed, 10 insertions(+), 1 deletion(-) diff --git a/net/rds/ib.h b/net/rds/ib.h index c88cb22..62fe7d5 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -259,6 +259,8 @@ struct rds_ib_statistics { uint64_ts_ib_rdma_mr_1m_pool_flush; uint64_ts_ib_rdma_mr_1m_pool_wait; uint64_ts_ib_rdma_mr_1m_pool_depleted; + uint64_ts_ib_rdma_mr_8k_reused; + uint64_ts_ib_rdma_mr_1m_reused; uint64_ts_ib_atomic_cswp; uint64_ts_ib_atomic_fadd; }; diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index 0e84843..ec7ea32 100644 --- a/net/rds/ib_rdma.c +++ b/net/rds/ib_rdma.c @@ -188,8 +188,13 @@ struct rds_ib_mr *rds_ib_reuse_mr(struct rds_ib_mr_pool *pool) flag = this_cpu_ptr(&clean_list_grace); set_bit(CLEAN_LIST_BUSY_BIT, flag); ret = llist_del_first(&pool->clean_list); - if (ret) + if (ret) { ibmr = llist_entry(ret, struct rds_ib_mr, llnode); + if (pool->pool_type == RDS_IB_MR_8K_POOL) + rds_ib_stats_inc(s_ib_rdma_mr_8k_reused); + else + rds_ib_stats_inc(s_ib_rdma_mr_1m_reused); + } clear_bit(CLEAN_LIST_BUSY_BIT, flag); preempt_enable(); diff --git a/net/rds/ib_stats.c b/net/rds/ib_stats.c index d77e044..7e78dca 100644 --- a/net/rds/ib_stats.c +++ b/net/rds/ib_stats.c @@ -73,6 +73,8 @@ static const char *const rds_ib_stat_names[] = { "ib_rdma_mr_1m_pool_flush", "ib_rdma_mr_1m_pool_wait", "ib_rdma_mr_1m_pool_depleted", + "ib_rdma_mr_8k_reused", + "ib_rdma_mr_1m_reused", "ib_atomic_cswp", "ib_atomic_fadd", }; -- 1.9.1
[net-next][PATCH v3 03/13] MAINTAINERS: update RDS entry
Acked-by: Chien Yen Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- MAINTAINERS | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 27393cf..08b084a 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9067,10 +9067,14 @@ S: Maintained F: drivers/net/ethernet/rdc/r6040.c RDS - RELIABLE DATAGRAM SOCKETS -M: Chien Yen +M: Santosh Shilimkar +L: net...@vger.kernel.org +L: linux-r...@vger.kernel.org L: rds-de...@oss.oracle.com (moderated for non-subscribers) +W: https://oss.oracle.com/projects/rds/ S: Supported F: net/rds/ +F: Documentation/networking/rds.txt READ-COPY UPDATE (RCU) M: "Paul E. McKenney" -- 1.9.1
RDS: Major clean-up with couple of new features for 4.6
v3: Re-generated the same series by omitting "-D" option from git format-patch command. Since first patch has file removals, git apply/am can't deal with it when formated with '-D' option. v2: Dropped module parameter from [PATCH 11/13] as suggested by David Miller Series is generated against net-next but also applies against Linus's tip cleanly. Entire patchset is available at below git tree: git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git for_4.6/net-next/rds_v2 The diff-stat looks bit scary since almost ~4K lines of code is getting removed. Brief summary of the series: - Drop the stale iWARP support: RDS iWarp support code has become stale and non testable for sometime. As discussed and agreed earlier on list, am dropping its support for good. If new iWarp user(s) shows up in future, the plan is to adapt existing IB RDMA with special sink case. - RDS gets SO_TIMESTAMP support - Long due RDS maintainer entry gets updated - Some RDS IB code refactoring towards new FastReg Memory registration (FRMR) - Lastly the initial support for FRMR RDS IB RDMA performance with FRMR is not yet as good as FMR and I do have some patches in progress to address that. But they are not ready for 4.6 so I left them out of this series. Also am keeping eye on new CQ API adaptations like other ULPs doing and will try to adapt RDS for the same most likely in 4.7+ timeframe. Santosh Shilimkar (12): RDS: Drop stale iWARP RDMA transport RDS: Add support for SO_TIMESTAMP for incoming messages MAINTAINERS: update RDS entry RDS: IB: Remove the RDS_IB_SEND_OP dependency RDS: IB: Re-organise ibmr code RDS: IB: create struct rds_ib_fmr RDS: IB: move FMR code to its own file RDS: IB: add connection info to ibmr RDS: IB: handle the RDMA CM time wait event RDS: IB: add mr reused stats RDS: IB: add Fastreg MR (FRMR) detection support RDS: IB: allocate extra space on queues for FRMR support Avinash Repaka (1): RDS: IB: Support Fastreg MR (FRMR) memory registration mode Documentation/networking/rds.txt | 4 +- MAINTAINERS | 6 +- net/rds/Kconfig | 7 +- net/rds/Makefile | 4 +- net/rds/af_rds.c | 26 ++ net/rds/ib.c | 47 +- net/rds/ib.h | 37 +- net/rds/ib_cm.c | 59 ++- net/rds/ib_fmr.c | 248 ++ net/rds/ib_frmr.c| 376 +++ net/rds/ib_mr.h | 148 ++ net/rds/ib_rdma.c| 495 ++-- net/rds/ib_send.c| 6 +- net/rds/ib_stats.c | 2 + net/rds/iw.c | 312 - net/rds/iw.h | 398 net/rds/iw_cm.c | 769 -- net/rds/iw_rdma.c| 837 - net/rds/iw_recv.c| 904 net/rds/iw_ring.c| 169 --- net/rds/iw_send.c| 981 --- net/rds/iw_stats.c | 95 net/rds/iw_sysctl.c | 123 - net/rds/rdma_transport.c | 21 +- net/rds/rdma_transport.h | 5 - net/rds/rds.h| 1 + net/rds/recv.c | 20 +- 27 files changed, 1065 insertions(+), 5035 deletions(-) create mode 100644 net/rds/ib_fmr.c create mode 100644 net/rds/ib_frmr.c create mode 100644 net/rds/ib_mr.h delete mode 100644 net/rds/iw.c delete mode 100644 net/rds/iw.h delete mode 100644 net/rds/iw_cm.c delete mode 100644 net/rds/iw_rdma.c delete mode 100644 net/rds/iw_recv.c delete mode 100644 net/rds/iw_ring.c delete mode 100644 net/rds/iw_send.c delete mode 100644 net/rds/iw_stats.c delete mode 100644 net/rds/iw_sysctl.c -- 1.9.1
[net-next][PATCH v3 13/13] RDS: IB: Support Fastreg MR (FRMR) memory registration mode
From: Avinash Repaka Fastreg MR(FRMR) is another method with which one can register memory to HCA. Some of the newer HCAs supports only fastreg mr mode, so we need to add support for it to have RDS functional on them. Signed-off-by: Santosh Shilimkar Signed-off-by: Avinash Repaka Signed-off-by: Santosh Shilimkar --- net/rds/Makefile | 2 +- net/rds/ib.h | 1 + net/rds/ib_cm.c | 7 +- net/rds/ib_frmr.c | 376 ++ net/rds/ib_mr.h | 24 net/rds/ib_rdma.c | 17 ++- 6 files changed, 422 insertions(+), 5 deletions(-) create mode 100644 net/rds/ib_frmr.c diff --git a/net/rds/Makefile b/net/rds/Makefile index bcf5591..0e72bec 100644 --- a/net/rds/Makefile +++ b/net/rds/Makefile @@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o message.o \ obj-$(CONFIG_RDS_RDMA) += rds_rdma.o rds_rdma-y := rdma_transport.o \ ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \ - ib_sysctl.o ib_rdma.o ib_fmr.o + ib_sysctl.o ib_rdma.o ib_fmr.o ib_frmr.o obj-$(CONFIG_RDS_TCP) += rds_tcp.o diff --git a/net/rds/ib.h b/net/rds/ib.h index eeb0d6c..627fb79 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -349,6 +349,7 @@ int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr); void rds_ib_add_conn(struct rds_ib_device *rds_ibdev, struct rds_connection *conn); void rds_ib_remove_conn(struct rds_ib_device *rds_ibdev, struct rds_connection *conn); void rds_ib_destroy_nodev_conns(void); +void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc); /* ib_recv.c */ int rds_ib_recv_init(void); diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index 83f4673..8764970 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -249,7 +249,12 @@ static void poll_scq(struct rds_ib_connection *ic, struct ib_cq *cq, (unsigned long long)wc->wr_id, wc->status, wc->byte_len, be32_to_cpu(wc->ex.imm_data)); - rds_ib_send_cqe_handler(ic, wc); + if (wc->wr_id <= ic->i_send_ring.w_nr || + wc->wr_id == RDS_IB_ACK_WR_ID) + rds_ib_send_cqe_handler(ic, wc); + else + rds_ib_mr_cqe_handler(ic, wc); + } } } diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c new file mode 100644 index 000..93ff038 --- /dev/null +++ b/net/rds/ib_frmr.c @@ -0,0 +1,376 @@ +/* + * Copyright (c) 2016 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + *copyright notice, this list of conditions and the following + *disclaimer. + * + * - Redistributions in binary form must reproduce the above + *copyright notice, this list of conditions and the following + *disclaimer in the documentation and/or other materials + *provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include "ib_mr.h" + +static struct rds_ib_mr *rds_ib_alloc_frmr(struct rds_ib_device *rds_ibdev, + int npages) +{ + struct rds_ib_mr_pool *pool; + struct rds_ib_mr *ibmr = NULL; + struct rds_ib_frmr *frmr; + int err = 0; + + if (npages <= RDS_MR_8K_MSG_SIZE) + pool = rds_ibdev->mr_8k_pool; + else + pool = rds_ibdev->mr_1m_pool; + + ibmr = rds_ib_try_reuse_ibmr(pool); + if (ibmr) + return ibmr; + + ibmr = kzalloc_node(sizeof(*ibmr), GFP_KERNEL, + rdsibdev_to_node(rds_ibdev)); + if (!ibmr) { + err = -ENOMEM; + goto out_no_cigar; + } + + frmr = &ibmr->u.frmr; + frmr->mr = ib_alloc_mr(rds_ibdev->pd, IB_MR_TYPE_MEM_REG, +pool->fmr_att
[net-next][PATCH v3 08/13] RDS: IB: add connection info to ibmr
Preperatory patch for FRMR support. From connection info, we can retrieve cm_id which contains qp handled needed for work request posting. We also need to drop the RDS connection on QP error states where connection handle becomes useful. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib_mr.h | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h index f5c1fcb..add7725 100644 --- a/net/rds/ib_mr.h +++ b/net/rds/ib_mr.h @@ -50,18 +50,19 @@ struct rds_ib_fmr { /* This is stored as mr->r_trans_private. */ struct rds_ib_mr { - struct rds_ib_device*device; - struct rds_ib_mr_pool *pool; + struct rds_ib_device*device; + struct rds_ib_mr_pool *pool; + struct rds_ib_connection*ic; - struct llist_node llnode; + struct llist_node llnode; /* unmap_list is for freeing */ - struct list_headunmap_list; - unsigned intremap_count; + struct list_headunmap_list; + unsigned intremap_count; - struct scatterlist *sg; - unsigned intsg_len; - int sg_dma_len; + struct scatterlist *sg; + unsigned intsg_len; + int sg_dma_len; union { struct rds_ib_fmr fmr; -- 1.9.1
[net-next][PATCH v3 09/13] RDS: IB: handle the RDMA CM time wait event
Drop the RDS connection on RDMA_CM_EVENT_TIMEWAIT_EXIT so that it can reconnect and resume. While testing fastreg, this error happened in couple of tests but was getting un-noticed. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/rdma_transport.c | 8 1 file changed, 8 insertions(+) diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c index 4f4b3d8..7220beb 100644 --- a/net/rds/rdma_transport.c +++ b/net/rds/rdma_transport.c @@ -117,6 +117,14 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id, rds_conn_drop(conn); break; + case RDMA_CM_EVENT_TIMEWAIT_EXIT: + if (conn) { + pr_info("RDS: RDMA_CM_EVENT_TIMEWAIT_EXIT event: dropping connection %pI4->%pI4\n", + &conn->c_laddr, &conn->c_faddr); + rds_conn_drop(conn); + } + break; + default: /* things like device disconnect? */ printk(KERN_ERR "RDS: unknown event %u (%s)!\n", -- 1.9.1
[net-next][PATCH v3 02/13] RDS: Add support for SO_TIMESTAMP for incoming messages
The SO_TIMESTAMP generates time stamp for each incoming RDS messages User app can enable it by using SO_TIMESTAMP setsocketopt() at SOL_SOCKET level. CMSG data of cmsg type SO_TIMESTAMP contains the time stamp in struct timeval format. Reviewed-by: Sowmini Varadhan Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/af_rds.c | 26 ++ net/rds/rds.h| 1 + net/rds/recv.c | 20 ++-- 3 files changed, 45 insertions(+), 2 deletions(-) diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c index b5476aeb..6beaeb1 100644 --- a/net/rds/af_rds.c +++ b/net/rds/af_rds.c @@ -277,6 +277,27 @@ static int rds_set_transport(struct rds_sock *rs, char __user *optval, return rs->rs_transport ? 0 : -ENOPROTOOPT; } +static int rds_enable_recvtstamp(struct sock *sk, char __user *optval, +int optlen) +{ + int val, valbool; + + if (optlen != sizeof(int)) + return -EFAULT; + + if (get_user(val, (int __user *)optval)) + return -EFAULT; + + valbool = val ? 1 : 0; + + if (valbool) + sock_set_flag(sk, SOCK_RCVTSTAMP); + else + sock_reset_flag(sk, SOCK_RCVTSTAMP); + + return 0; +} + static int rds_setsockopt(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen) { @@ -312,6 +333,11 @@ static int rds_setsockopt(struct socket *sock, int level, int optname, ret = rds_set_transport(rs, optval, optlen); release_sock(sock->sk); break; + case SO_TIMESTAMP: + lock_sock(sock->sk); + ret = rds_enable_recvtstamp(sock->sk, optval, optlen); + release_sock(sock->sk); + break; default: ret = -ENOPROTOOPT; } diff --git a/net/rds/rds.h b/net/rds/rds.h index 0e2797b..80256b0 100644 --- a/net/rds/rds.h +++ b/net/rds/rds.h @@ -222,6 +222,7 @@ struct rds_incoming { __be32 i_saddr; rds_rdma_cookie_t i_rdma_cookie; + struct timeval i_rx_tstamp; }; struct rds_mr { diff --git a/net/rds/recv.c b/net/rds/recv.c index a00462b..c0be1ec 100644 --- a/net/rds/recv.c +++ b/net/rds/recv.c @@ -35,6 +35,8 @@ #include #include #include +#include +#include #include "rds.h" @@ -46,6 +48,8 @@ void rds_inc_init(struct rds_incoming *inc, struct rds_connection *conn, inc->i_conn = conn; inc->i_saddr = saddr; inc->i_rdma_cookie = 0; + inc->i_rx_tstamp.tv_sec = 0; + inc->i_rx_tstamp.tv_usec = 0; } EXPORT_SYMBOL_GPL(rds_inc_init); @@ -228,6 +232,8 @@ void rds_recv_incoming(struct rds_connection *conn, __be32 saddr, __be32 daddr, rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong, be32_to_cpu(inc->i_hdr.h_len), inc->i_hdr.h_dport); + if (sock_flag(sk, SOCK_RCVTSTAMP)) + do_gettimeofday(&inc->i_rx_tstamp); rds_inc_addref(inc); list_add_tail(&inc->i_item, &rs->rs_recv_queue); __rds_wake_sk_sleep(sk); @@ -381,7 +387,8 @@ static int rds_notify_cong(struct rds_sock *rs, struct msghdr *msghdr) /* * Receive any control messages. */ -static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg) +static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg, +struct rds_sock *rs) { int ret = 0; @@ -392,6 +399,15 @@ static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg) return ret; } + if ((inc->i_rx_tstamp.tv_sec != 0) && + sock_flag(rds_rs_to_sk(rs), SOCK_RCVTSTAMP)) { + ret = put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMP, + sizeof(struct timeval), + &inc->i_rx_tstamp); + if (ret) + return ret; + } + return 0; } @@ -474,7 +490,7 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, msg->msg_flags |= MSG_TRUNC; } - if (rds_cmsg_recv(inc, msg)) { + if (rds_cmsg_recv(inc, msg, rs)) { ret = -EFAULT; goto out; } -- 1.9.1
[net-next][PATCH v3 11/13] RDS: IB: add Fastreg MR (FRMR) detection support
Discovere Fast Memmory Registration support using IB device IB_DEVICE_MEM_MGT_EXTENSIONS. Certain HCA might support just FRMR or FMR or both FMR and FRWR. In case both mr type are supported, default FMR is used. Default MR is still kept as FMR against what everyone else is following. Default will be changed to FRMR once the RDS performance with FRMR is comparable with FMR. The work is in progress for the same. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib.c| 10 ++ net/rds/ib.h| 4 net/rds/ib_mr.h | 1 + 3 files changed, 15 insertions(+) diff --git a/net/rds/ib.c b/net/rds/ib.c index bb32cb9..b5342fd 100644 --- a/net/rds/ib.c +++ b/net/rds/ib.c @@ -140,6 +140,12 @@ static void rds_ib_add_one(struct ib_device *device) rds_ibdev->max_wrs = device->attrs.max_qp_wr; rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE); + rds_ibdev->has_fr = (device->attrs.device_cap_flags & + IB_DEVICE_MEM_MGT_EXTENSIONS); + rds_ibdev->has_fmr = (device->alloc_fmr && device->dealloc_fmr && + device->map_phys_fmr && device->unmap_fmr); + rds_ibdev->use_fastreg = (rds_ibdev->has_fr && !rds_ibdev->has_fmr); + rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32; rds_ibdev->max_1m_mrs = device->attrs.max_mr ? min_t(unsigned int, (device->attrs.max_mr / 2), @@ -178,6 +184,10 @@ static void rds_ib_add_one(struct ib_device *device) rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs, rds_ibdev->max_8k_mrs); + pr_info("RDS/IB: %s: %s supported and preferred\n", + device->name, + rds_ibdev->use_fastreg ? "FRMR" : "FMR"); + INIT_LIST_HEAD(&rds_ibdev->ipaddr_list); INIT_LIST_HEAD(&rds_ibdev->conn_list); diff --git a/net/rds/ib.h b/net/rds/ib.h index 62fe7d5..c5eddc2 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -200,6 +200,10 @@ struct rds_ib_device { struct list_headconn_list; struct ib_device*dev; struct ib_pd*pd; + boolhas_fmr; + boolhas_fr; + booluse_fastreg; + unsigned intmax_mrs; struct rds_ib_mr_pool *mr_1m_pool; struct rds_ib_mr_pool *mr_8k_pool; diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h index add7725..2f9b9c3 100644 --- a/net/rds/ib_mr.h +++ b/net/rds/ib_mr.h @@ -93,6 +93,7 @@ struct rds_ib_mr_pool { extern struct workqueue_struct *rds_ib_mr_wq; extern unsigned int rds_ib_mr_1m_pool_size; extern unsigned int rds_ib_mr_8k_pool_size; +extern bool prefer_frmr; struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *rds_dev, int npages); -- 1.9.1
Re: [net-next][PATCH v2 00/13] RDS: Major clean-up with couple of new features for 4.6
On 3/1/2016 2:33 PM, David Miller wrote: When I try to apply this series, it (strangely) fails on the first patch with: Strange indeed since patches and the tree is against net-next. Applying: RDS: Drop stale iWARP RDMA transport error: removal patch leaves file contents This patch has file removals and looks like git am/apply won't work when patch is formatted with "-D". Its good for review but didn't realize it will create problem for apply. Sorry for that but I didn't know this issue. git merge or pull seems to work though when tried from branch directly. Please sort this out and respin, thanks. OK. Will send the same series again just with first patch generated without -D option. Thanks !! Regards, Santosh
[net-next][PATCH v2 02/13] RDS: Add support for SO_TIMESTAMP for incoming messages
The SO_TIMESTAMP generates time stamp for each incoming RDS messages User app can enable it by using SO_TIMESTAMP setsocketopt() at SOL_SOCKET level. CMSG data of cmsg type SO_TIMESTAMP contains the time stamp in struct timeval format. Reviewed-by: Sowmini Varadhan Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/af_rds.c | 26 ++ net/rds/rds.h| 1 + net/rds/recv.c | 20 ++-- 3 files changed, 45 insertions(+), 2 deletions(-) diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c index b5476aeb..6beaeb1 100644 --- a/net/rds/af_rds.c +++ b/net/rds/af_rds.c @@ -277,6 +277,27 @@ static int rds_set_transport(struct rds_sock *rs, char __user *optval, return rs->rs_transport ? 0 : -ENOPROTOOPT; } +static int rds_enable_recvtstamp(struct sock *sk, char __user *optval, +int optlen) +{ + int val, valbool; + + if (optlen != sizeof(int)) + return -EFAULT; + + if (get_user(val, (int __user *)optval)) + return -EFAULT; + + valbool = val ? 1 : 0; + + if (valbool) + sock_set_flag(sk, SOCK_RCVTSTAMP); + else + sock_reset_flag(sk, SOCK_RCVTSTAMP); + + return 0; +} + static int rds_setsockopt(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen) { @@ -312,6 +333,11 @@ static int rds_setsockopt(struct socket *sock, int level, int optname, ret = rds_set_transport(rs, optval, optlen); release_sock(sock->sk); break; + case SO_TIMESTAMP: + lock_sock(sock->sk); + ret = rds_enable_recvtstamp(sock->sk, optval, optlen); + release_sock(sock->sk); + break; default: ret = -ENOPROTOOPT; } diff --git a/net/rds/rds.h b/net/rds/rds.h index 0e2797b..80256b0 100644 --- a/net/rds/rds.h +++ b/net/rds/rds.h @@ -222,6 +222,7 @@ struct rds_incoming { __be32 i_saddr; rds_rdma_cookie_t i_rdma_cookie; + struct timeval i_rx_tstamp; }; struct rds_mr { diff --git a/net/rds/recv.c b/net/rds/recv.c index a00462b..c0be1ec 100644 --- a/net/rds/recv.c +++ b/net/rds/recv.c @@ -35,6 +35,8 @@ #include #include #include +#include +#include #include "rds.h" @@ -46,6 +48,8 @@ void rds_inc_init(struct rds_incoming *inc, struct rds_connection *conn, inc->i_conn = conn; inc->i_saddr = saddr; inc->i_rdma_cookie = 0; + inc->i_rx_tstamp.tv_sec = 0; + inc->i_rx_tstamp.tv_usec = 0; } EXPORT_SYMBOL_GPL(rds_inc_init); @@ -228,6 +232,8 @@ void rds_recv_incoming(struct rds_connection *conn, __be32 saddr, __be32 daddr, rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong, be32_to_cpu(inc->i_hdr.h_len), inc->i_hdr.h_dport); + if (sock_flag(sk, SOCK_RCVTSTAMP)) + do_gettimeofday(&inc->i_rx_tstamp); rds_inc_addref(inc); list_add_tail(&inc->i_item, &rs->rs_recv_queue); __rds_wake_sk_sleep(sk); @@ -381,7 +387,8 @@ static int rds_notify_cong(struct rds_sock *rs, struct msghdr *msghdr) /* * Receive any control messages. */ -static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg) +static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg, +struct rds_sock *rs) { int ret = 0; @@ -392,6 +399,15 @@ static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg) return ret; } + if ((inc->i_rx_tstamp.tv_sec != 0) && + sock_flag(rds_rs_to_sk(rs), SOCK_RCVTSTAMP)) { + ret = put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMP, + sizeof(struct timeval), + &inc->i_rx_tstamp); + if (ret) + return ret; + } + return 0; } @@ -474,7 +490,7 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, msg->msg_flags |= MSG_TRUNC; } - if (rds_cmsg_recv(inc, msg)) { + if (rds_cmsg_recv(inc, msg, rs)) { ret = -EFAULT; goto out; } -- 1.9.1
[net-next][PATCH v2 05/13] RDS: IB: Re-organise ibmr code
No functional changes. This is in preperation towards adding fastreg memory resgitration support. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/Makefile | 2 +- net/rds/ib.c | 37 +++--- net/rds/ib.h | 25 +--- net/rds/ib_fmr.c | 217 +++ net/rds/ib_mr.h | 109 net/rds/ib_rdma.c | 379 +++--- 6 files changed, 422 insertions(+), 347 deletions(-) create mode 100644 net/rds/ib_fmr.c create mode 100644 net/rds/ib_mr.h diff --git a/net/rds/Makefile b/net/rds/Makefile index 19e5485..bcf5591 100644 --- a/net/rds/Makefile +++ b/net/rds/Makefile @@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o message.o \ obj-$(CONFIG_RDS_RDMA) += rds_rdma.o rds_rdma-y := rdma_transport.o \ ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \ - ib_sysctl.o ib_rdma.o + ib_sysctl.o ib_rdma.o ib_fmr.o obj-$(CONFIG_RDS_TCP) += rds_tcp.o diff --git a/net/rds/ib.c b/net/rds/ib.c index 9481d55..bb32cb9 100644 --- a/net/rds/ib.c +++ b/net/rds/ib.c @@ -42,15 +42,16 @@ #include "rds.h" #include "ib.h" +#include "ib_mr.h" -unsigned int rds_ib_fmr_1m_pool_size = RDS_FMR_1M_POOL_SIZE; -unsigned int rds_ib_fmr_8k_pool_size = RDS_FMR_8K_POOL_SIZE; +unsigned int rds_ib_mr_1m_pool_size = RDS_MR_1M_POOL_SIZE; +unsigned int rds_ib_mr_8k_pool_size = RDS_MR_8K_POOL_SIZE; unsigned int rds_ib_retry_count = RDS_IB_DEFAULT_RETRY_COUNT; -module_param(rds_ib_fmr_1m_pool_size, int, 0444); -MODULE_PARM_DESC(rds_ib_fmr_1m_pool_size, " Max number of 1M fmr per HCA"); -module_param(rds_ib_fmr_8k_pool_size, int, 0444); -MODULE_PARM_DESC(rds_ib_fmr_8k_pool_size, " Max number of 8K fmr per HCA"); +module_param(rds_ib_mr_1m_pool_size, int, 0444); +MODULE_PARM_DESC(rds_ib_mr_1m_pool_size, " Max number of 1M mr per HCA"); +module_param(rds_ib_mr_8k_pool_size, int, 0444); +MODULE_PARM_DESC(rds_ib_mr_8k_pool_size, " Max number of 8K mr per HCA"); module_param(rds_ib_retry_count, int, 0444); MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before reporting an error"); @@ -140,13 +141,13 @@ static void rds_ib_add_one(struct ib_device *device) rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE); rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32; - rds_ibdev->max_1m_fmrs = device->attrs.max_mr ? + rds_ibdev->max_1m_mrs = device->attrs.max_mr ? min_t(unsigned int, (device->attrs.max_mr / 2), - rds_ib_fmr_1m_pool_size) : rds_ib_fmr_1m_pool_size; + rds_ib_mr_1m_pool_size) : rds_ib_mr_1m_pool_size; - rds_ibdev->max_8k_fmrs = device->attrs.max_mr ? + rds_ibdev->max_8k_mrs = device->attrs.max_mr ? min_t(unsigned int, ((device->attrs.max_mr / 2) * RDS_MR_8K_SCALE), - rds_ib_fmr_8k_pool_size) : rds_ib_fmr_8k_pool_size; + rds_ib_mr_8k_pool_size) : rds_ib_mr_8k_pool_size; rds_ibdev->max_initiator_depth = device->attrs.max_qp_init_rd_atom; rds_ibdev->max_responder_resources = device->attrs.max_qp_rd_atom; @@ -172,10 +173,10 @@ static void rds_ib_add_one(struct ib_device *device) goto put_dev; } - rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, fmr_max_remaps = %d, max_1m_fmrs = %d, max_8k_fmrs = %d\n", + rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, fmr_max_remaps = %d, max_1m_mrs = %d, max_8k_mrs = %d\n", device->attrs.max_fmr, rds_ibdev->max_wrs, rds_ibdev->max_sge, -rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_fmrs, -rds_ibdev->max_8k_fmrs); +rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs, +rds_ibdev->max_8k_mrs); INIT_LIST_HEAD(&rds_ibdev->ipaddr_list); INIT_LIST_HEAD(&rds_ibdev->conn_list); @@ -364,7 +365,7 @@ void rds_ib_exit(void) rds_ib_sysctl_exit(); rds_ib_recv_exit(); rds_trans_unregister(&rds_ib_transport); - rds_ib_fmr_exit(); + rds_ib_mr_exit(); } struct rds_transport rds_ib_transport = { @@ -400,13 +401,13 @@ int rds_ib_init(void) INIT_LIST_HEAD(&rds_ib_devices); - ret = rds_ib_fmr_init(); + ret = rds_ib_mr_init(); if (ret) goto out; ret = ib_register_client(&rds_ib_client); if (ret) - goto out_fmr_exit; + goto out_mr_exit; ret = rds_ib_sysctl_init(); if (ret) @@ -430,8 +431,8 @@ out_sysctl: rds_ib_sysctl_exit(); out_ibreg: rds_ib_unregister_client(); -out_fmr_exi
[net-next][PATCH v2 04/13] RDS: IB: Remove the RDS_IB_SEND_OP dependency
This helps to combine asynchronous fastreg MR completion handler with send completion handler. No functional change. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib.h | 1 - net/rds/ib_cm.c | 42 +++--- net/rds/ib_send.c | 6 ++ 3 files changed, 29 insertions(+), 20 deletions(-) diff --git a/net/rds/ib.h b/net/rds/ib.h index b3fdebb..09cd8e3 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -28,7 +28,6 @@ #define RDS_IB_RECYCLE_BATCH_COUNT 32 #define RDS_IB_WC_MAX 32 -#define RDS_IB_SEND_OP BIT_ULL(63) extern struct rw_semaphore rds_ib_devices_lock; extern struct list_head rds_ib_devices; diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index da5a7fb..7f68abc 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -236,12 +236,10 @@ static void rds_ib_cq_comp_handler_recv(struct ib_cq *cq, void *context) tasklet_schedule(&ic->i_recv_tasklet); } -static void poll_cq(struct rds_ib_connection *ic, struct ib_cq *cq, - struct ib_wc *wcs, - struct rds_ib_ack_state *ack_state) +static void poll_scq(struct rds_ib_connection *ic, struct ib_cq *cq, +struct ib_wc *wcs) { - int nr; - int i; + int nr, i; struct ib_wc *wc; while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) { @@ -251,10 +249,7 @@ static void poll_cq(struct rds_ib_connection *ic, struct ib_cq *cq, (unsigned long long)wc->wr_id, wc->status, wc->byte_len, be32_to_cpu(wc->ex.imm_data)); - if (wc->wr_id & RDS_IB_SEND_OP) - rds_ib_send_cqe_handler(ic, wc); - else - rds_ib_recv_cqe_handler(ic, wc, ack_state); + rds_ib_send_cqe_handler(ic, wc); } } } @@ -263,14 +258,12 @@ static void rds_ib_tasklet_fn_send(unsigned long data) { struct rds_ib_connection *ic = (struct rds_ib_connection *)data; struct rds_connection *conn = ic->conn; - struct rds_ib_ack_state state; rds_ib_stats_inc(s_ib_tasklet_call); - memset(&state, 0, sizeof(state)); - poll_cq(ic, ic->i_send_cq, ic->i_send_wc, &state); + poll_scq(ic, ic->i_send_cq, ic->i_send_wc); ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP); - poll_cq(ic, ic->i_send_cq, ic->i_send_wc, &state); + poll_scq(ic, ic->i_send_cq, ic->i_send_wc); if (rds_conn_up(conn) && (!test_bit(RDS_LL_SEND_FULL, &conn->c_flags) || @@ -278,6 +271,25 @@ static void rds_ib_tasklet_fn_send(unsigned long data) rds_send_xmit(ic->conn); } +static void poll_rcq(struct rds_ib_connection *ic, struct ib_cq *cq, +struct ib_wc *wcs, +struct rds_ib_ack_state *ack_state) +{ + int nr, i; + struct ib_wc *wc; + + while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) { + for (i = 0; i < nr; i++) { + wc = wcs + i; + rdsdebug("wc wr_id 0x%llx status %u byte_len %u imm_data %u\n", +(unsigned long long)wc->wr_id, wc->status, +wc->byte_len, be32_to_cpu(wc->ex.imm_data)); + + rds_ib_recv_cqe_handler(ic, wc, ack_state); + } + } +} + static void rds_ib_tasklet_fn_recv(unsigned long data) { struct rds_ib_connection *ic = (struct rds_ib_connection *)data; @@ -291,9 +303,9 @@ static void rds_ib_tasklet_fn_recv(unsigned long data) rds_ib_stats_inc(s_ib_tasklet_call); memset(&state, 0, sizeof(state)); - poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, &state); + poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, &state); ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED); - poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, &state); + poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, &state); if (state.ack_next_valid) rds_ib_set_ack(ic, state.ack_next, state.ack_required); diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c index eac30bf..f27d2c8 100644 --- a/net/rds/ib_send.c +++ b/net/rds/ib_send.c @@ -195,7 +195,7 @@ void rds_ib_send_init_ring(struct rds_ib_connection *ic) send->s_op = NULL; - send->s_wr.wr_id = i | RDS_IB_SEND_OP; + send->s_wr.wr_id = i; send->s_wr.sg_list = send->s_sge; send->s_wr.ex.imm_data = 0; @@ -263,9 +263,7 @@ void rds_ib_send_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc) oldest = rds_ib_ring_oldest(&ic->i_send_r
[net-next][PATCH v2 06/13] RDS: IB: create struct rds_ib_fmr
Keep fmr related filed in its own struct. Fastreg MR structure will be added to the union. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib_fmr.c | 17 ++--- net/rds/ib_mr.h | 11 +-- net/rds/ib_rdma.c | 14 ++ 3 files changed, 29 insertions(+), 13 deletions(-) diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c index d4f200d..74f2c21 100644 --- a/net/rds/ib_fmr.c +++ b/net/rds/ib_fmr.c @@ -36,6 +36,7 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev, int npages) { struct rds_ib_mr_pool *pool; struct rds_ib_mr *ibmr = NULL; + struct rds_ib_fmr *fmr; int err = 0, iter = 0; if (npages <= RDS_MR_8K_MSG_SIZE) @@ -99,15 +100,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev, int npages) goto out_no_cigar; } - ibmr->fmr = ib_alloc_fmr(rds_ibdev->pd, + fmr = &ibmr->u.fmr; + fmr->fmr = ib_alloc_fmr(rds_ibdev->pd, (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ | IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_ATOMIC), &pool->fmr_attr); - if (IS_ERR(ibmr->fmr)) { - err = PTR_ERR(ibmr->fmr); - ibmr->fmr = NULL; + if (IS_ERR(fmr->fmr)) { + err = PTR_ERR(fmr->fmr); + fmr->fmr = NULL; pr_warn("RDS/IB: %s failed (err=%d)\n", __func__, err); goto out_no_cigar; } @@ -122,8 +124,8 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev, int npages) out_no_cigar: if (ibmr) { - if (ibmr->fmr) - ib_dealloc_fmr(ibmr->fmr); + if (fmr->fmr) + ib_dealloc_fmr(fmr->fmr); kfree(ibmr); } atomic_dec(&pool->item_count); @@ -134,6 +136,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct rds_ib_mr *ibmr, struct scatterlist *sg, unsigned int nents) { struct ib_device *dev = rds_ibdev->dev; + struct rds_ib_fmr *fmr = &ibmr->u.fmr; struct scatterlist *scat = sg; u64 io_addr = 0; u64 *dma_pages; @@ -190,7 +193,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct rds_ib_mr *ibmr, (dma_addr & PAGE_MASK) + j; } - ret = ib_map_phys_fmr(ibmr->fmr, dma_pages, page_cnt, io_addr); + ret = ib_map_phys_fmr(fmr->fmr, dma_pages, page_cnt, io_addr); if (ret) goto out; diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h index d88724f..309ad59 100644 --- a/net/rds/ib_mr.h +++ b/net/rds/ib_mr.h @@ -43,11 +43,15 @@ #define RDS_MR_8K_SCALE(256 / (RDS_MR_8K_MSG_SIZE + 1)) #define RDS_MR_8K_POOL_SIZE(RDS_MR_8K_SCALE * (8192 / 2)) +struct rds_ib_fmr { + struct ib_fmr *fmr; + u64 *dma; +}; + /* This is stored as mr->r_trans_private. */ struct rds_ib_mr { struct rds_ib_device*device; struct rds_ib_mr_pool *pool; - struct ib_fmr *fmr; struct llist_node llnode; @@ -57,8 +61,11 @@ struct rds_ib_mr { struct scatterlist *sg; unsigned intsg_len; - u64 *dma; int sg_dma_len; + + union { + struct rds_ib_fmr fmr; + } u; }; /* Our own little MR pool */ diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index c594519..9e608d9 100644 --- a/net/rds/ib_rdma.c +++ b/net/rds/ib_rdma.c @@ -334,6 +334,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, int free_all, struct rds_ib_mr **ibmr_ret) { struct rds_ib_mr *ibmr, *next; + struct rds_ib_fmr *fmr; struct llist_node *clean_nodes; struct llist_node *clean_tail; LIST_HEAD(unmap_list); @@ -395,8 +396,10 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, goto out; /* String all ib_mr's onto one list and hand them to ib_unmap_fmr */ - list_for_each_entry(ibmr, &unmap_list, unmap_list) - list_add(&ibmr->fmr->list, &fmr_list); + list_for_each_entry(ibmr, &unmap_list, unmap_list) { + fmr = &ibmr->u.fmr; + list_add(&fmr->fmr->list, &fmr_list); + } ret = ib_unmap_fmr(&fmr_list); if (ret) @@ -405,6 +408,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, /* Now we can destroy the DMA mapping and unpin any pages */ list_for_each_entry_safe(ibmr, next, &unmap_list, unmap_list) { unpinned += ibmr->sg_len; +
[net-next][PATCH v2 03/13] MAINTAINERS: update RDS entry
Acked-by: Chien Yen Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- MAINTAINERS | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 27393cf..08b084a 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9067,10 +9067,14 @@ S: Maintained F: drivers/net/ethernet/rdc/r6040.c RDS - RELIABLE DATAGRAM SOCKETS -M: Chien Yen +M: Santosh Shilimkar +L: net...@vger.kernel.org +L: linux-r...@vger.kernel.org L: rds-de...@oss.oracle.com (moderated for non-subscribers) +W: https://oss.oracle.com/projects/rds/ S: Supported F: net/rds/ +F: Documentation/networking/rds.txt READ-COPY UPDATE (RCU) M: "Paul E. McKenney" -- 1.9.1
[net-next][PATCH v2 01/13] RDS: Drop stale iWARP RDMA transport
RDS iWarp support code has become stale and non testable. As indicated earlier, am dropping the support for it. If new iWarp user(s) shows up in future, we can adapat the RDS IB transprt for the special RDMA READ sink case. iWarp needs an MR for the RDMA READ sink. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- Documentation/networking/rds.txt | 4 +- net/rds/Kconfig | 7 +- net/rds/Makefile | 4 +- net/rds/iw.c | 312 - net/rds/iw.h | 398 net/rds/iw_cm.c | 769 -- net/rds/iw_rdma.c| 837 - net/rds/iw_recv.c| 904 net/rds/iw_ring.c| 169 --- net/rds/iw_send.c| 981 --- net/rds/iw_stats.c | 95 net/rds/iw_sysctl.c | 123 - net/rds/rdma_transport.c | 13 +- net/rds/rdma_transport.h | 5 - 14 files changed, 7 insertions(+), 4614 deletions(-) delete mode 100644 net/rds/iw.c delete mode 100644 net/rds/iw.h delete mode 100644 net/rds/iw_cm.c delete mode 100644 net/rds/iw_rdma.c delete mode 100644 net/rds/iw_recv.c delete mode 100644 net/rds/iw_ring.c delete mode 100644 net/rds/iw_send.c delete mode 100644 net/rds/iw_stats.c delete mode 100644 net/rds/iw_sysctl.c diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt index e1a3d59..9d219d8 100644 --- a/Documentation/networking/rds.txt +++ b/Documentation/networking/rds.txt @@ -19,9 +19,7 @@ to N*N if you use a connection-oriented socket transport like TCP. RDS is not Infiniband-specific; it was designed to support different transports. The current implementation used to support RDS over TCP as well -as IB. Work is in progress to support RDS over iWARP, and using DCE to -guarantee no dropped packets on Ethernet, it may be possible to use RDS over -UDP in the future. +as IB. The high-level semantics of RDS from the application's point of view are diff --git a/net/rds/Kconfig b/net/rds/Kconfig index f2c670b..bffde4b 100644 --- a/net/rds/Kconfig +++ b/net/rds/Kconfig @@ -4,14 +4,13 @@ config RDS depends on INET ---help--- The RDS (Reliable Datagram Sockets) protocol provides reliable, - sequenced delivery of datagrams over Infiniband, iWARP, - or TCP. + sequenced delivery of datagrams over Infiniband or TCP. config RDS_RDMA - tristate "RDS over Infiniband and iWARP" + tristate "RDS over Infiniband" depends on RDS && INFINIBAND && INFINIBAND_ADDR_TRANS ---help--- - Allow RDS to use Infiniband and iWARP as a transport. + Allow RDS to use Infiniband as a transport. This transport supports RDMA operations. config RDS_TCP diff --git a/net/rds/Makefile b/net/rds/Makefile index 56d3f60..19e5485 100644 --- a/net/rds/Makefile +++ b/net/rds/Makefile @@ -6,9 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o message.o \ obj-$(CONFIG_RDS_RDMA) += rds_rdma.o rds_rdma-y := rdma_transport.o \ ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \ - ib_sysctl.o ib_rdma.o \ - iw.o iw_cm.o iw_recv.o iw_ring.o iw_send.o iw_stats.o \ - iw_sysctl.o iw_rdma.o + ib_sysctl.o ib_rdma.o obj-$(CONFIG_RDS_TCP) += rds_tcp.o diff --git a/net/rds/iw.c b/net/rds/iw.c deleted file mode 100644 index f4a9fff..000 diff --git a/net/rds/iw.h b/net/rds/iw.h deleted file mode 100644 index 5af01d1..000 diff --git a/net/rds/iw_cm.c b/net/rds/iw_cm.c deleted file mode 100644 index aea4c91..000 diff --git a/net/rds/iw_rdma.c b/net/rds/iw_rdma.c deleted file mode 100644 index b09a40c..000 diff --git a/net/rds/iw_recv.c b/net/rds/iw_recv.c deleted file mode 100644 index a66d179..000 diff --git a/net/rds/iw_ring.c b/net/rds/iw_ring.c deleted file mode 100644 index da8e3b6..000 diff --git a/net/rds/iw_send.c b/net/rds/iw_send.c deleted file mode 100644 index e20bd50..000 diff --git a/net/rds/iw_stats.c b/net/rds/iw_stats.c deleted file mode 100644 index 5fe67f6..000 diff --git a/net/rds/iw_sysctl.c b/net/rds/iw_sysctl.c deleted file mode 100644 index 139239d..000 diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c index 9c1fed8..4f4b3d8 100644 --- a/net/rds/rdma_transport.c +++ b/net/rds/rdma_transport.c @@ -49,9 +49,7 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id, rdsdebug("conn %p id %p handling event %u (%s)\n", conn, cm_id, event->event, rdma_event_msg(event->event)); - if (cm_id->device->node_type == RDMA_NODE_RNIC) - trans = &
[net-next][PATCH v2 10/13] RDS: IB: add mr reused stats
Add MR reuse statistics to RDS IB transport. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib.h | 2 ++ net/rds/ib_rdma.c | 7 ++- net/rds/ib_stats.c | 2 ++ 3 files changed, 10 insertions(+), 1 deletion(-) diff --git a/net/rds/ib.h b/net/rds/ib.h index c88cb22..62fe7d5 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -259,6 +259,8 @@ struct rds_ib_statistics { uint64_ts_ib_rdma_mr_1m_pool_flush; uint64_ts_ib_rdma_mr_1m_pool_wait; uint64_ts_ib_rdma_mr_1m_pool_depleted; + uint64_ts_ib_rdma_mr_8k_reused; + uint64_ts_ib_rdma_mr_1m_reused; uint64_ts_ib_atomic_cswp; uint64_ts_ib_atomic_fadd; }; diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index 0e84843..ec7ea32 100644 --- a/net/rds/ib_rdma.c +++ b/net/rds/ib_rdma.c @@ -188,8 +188,13 @@ struct rds_ib_mr *rds_ib_reuse_mr(struct rds_ib_mr_pool *pool) flag = this_cpu_ptr(&clean_list_grace); set_bit(CLEAN_LIST_BUSY_BIT, flag); ret = llist_del_first(&pool->clean_list); - if (ret) + if (ret) { ibmr = llist_entry(ret, struct rds_ib_mr, llnode); + if (pool->pool_type == RDS_IB_MR_8K_POOL) + rds_ib_stats_inc(s_ib_rdma_mr_8k_reused); + else + rds_ib_stats_inc(s_ib_rdma_mr_1m_reused); + } clear_bit(CLEAN_LIST_BUSY_BIT, flag); preempt_enable(); diff --git a/net/rds/ib_stats.c b/net/rds/ib_stats.c index d77e044..7e78dca 100644 --- a/net/rds/ib_stats.c +++ b/net/rds/ib_stats.c @@ -73,6 +73,8 @@ static const char *const rds_ib_stat_names[] = { "ib_rdma_mr_1m_pool_flush", "ib_rdma_mr_1m_pool_wait", "ib_rdma_mr_1m_pool_depleted", + "ib_rdma_mr_8k_reused", + "ib_rdma_mr_1m_reused", "ib_atomic_cswp", "ib_atomic_fadd", }; -- 1.9.1
[net-next][PATCH v2 07/13] RDS: IB: move FMR code to its own file
No functional change. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib_fmr.c | 126 +- net/rds/ib_mr.h | 6 +++ net/rds/ib_rdma.c | 108 ++ 3 files changed, 134 insertions(+), 106 deletions(-) diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c index 74f2c21..4fe8f4f 100644 --- a/net/rds/ib_fmr.c +++ b/net/rds/ib_fmr.c @@ -37,61 +37,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev, int npages) struct rds_ib_mr_pool *pool; struct rds_ib_mr *ibmr = NULL; struct rds_ib_fmr *fmr; - int err = 0, iter = 0; + int err = 0; if (npages <= RDS_MR_8K_MSG_SIZE) pool = rds_ibdev->mr_8k_pool; else pool = rds_ibdev->mr_1m_pool; - if (atomic_read(&pool->dirty_count) >= pool->max_items / 10) - queue_delayed_work(rds_ib_mr_wq, &pool->flush_worker, 10); - - /* Switch pools if one of the pool is reaching upper limit */ - if (atomic_read(&pool->dirty_count) >= pool->max_items * 9 / 10) { - if (pool->pool_type == RDS_IB_MR_8K_POOL) - pool = rds_ibdev->mr_1m_pool; - else - pool = rds_ibdev->mr_8k_pool; - } - - while (1) { - ibmr = rds_ib_reuse_mr(pool); - if (ibmr) - return ibmr; - - /* No clean MRs - now we have the choice of either -* allocating a fresh MR up to the limit imposed by the -* driver, or flush any dirty unused MRs. -* We try to avoid stalling in the send path if possible, -* so we allocate as long as we're allowed to. -* -* We're fussy with enforcing the FMR limit, though. If the -* driver tells us we can't use more than N fmrs, we shouldn't -* start arguing with it -*/ - if (atomic_inc_return(&pool->item_count) <= pool->max_items) - break; - - atomic_dec(&pool->item_count); - - if (++iter > 2) { - if (pool->pool_type == RDS_IB_MR_8K_POOL) - rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_depleted); - else - rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_depleted); - return ERR_PTR(-EAGAIN); - } - - /* We do have some empty MRs. Flush them out. */ - if (pool->pool_type == RDS_IB_MR_8K_POOL) - rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_wait); - else - rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_wait); - rds_ib_flush_mr_pool(pool, 0, &ibmr); - if (ibmr) - return ibmr; - } + ibmr = rds_ib_try_reuse_ibmr(pool); + if (ibmr) + return ibmr; ibmr = kzalloc_node(sizeof(*ibmr), GFP_KERNEL, rdsibdev_to_node(rds_ibdev)); @@ -218,3 +173,76 @@ out: return ret; } + +struct rds_ib_mr *rds_ib_reg_fmr(struct rds_ib_device *rds_ibdev, +struct scatterlist *sg, +unsigned long nents, +u32 *key) +{ + struct rds_ib_mr *ibmr = NULL; + struct rds_ib_fmr *fmr; + int ret; + + ibmr = rds_ib_alloc_fmr(rds_ibdev, nents); + if (IS_ERR(ibmr)) + return ibmr; + + ibmr->device = rds_ibdev; + fmr = &ibmr->u.fmr; + ret = rds_ib_map_fmr(rds_ibdev, ibmr, sg, nents); + if (ret == 0) + *key = fmr->fmr->rkey; + else + rds_ib_free_mr(ibmr, 0); + + return ibmr; +} + +void rds_ib_unreg_fmr(struct list_head *list, unsigned int *nfreed, + unsigned long *unpinned, unsigned int goal) +{ + struct rds_ib_mr *ibmr, *next; + struct rds_ib_fmr *fmr; + LIST_HEAD(fmr_list); + int ret = 0; + unsigned int freed = *nfreed; + + /* String all ib_mr's onto one list and hand them to ib_unmap_fmr */ + list_for_each_entry(ibmr, list, unmap_list) { + fmr = &ibmr->u.fmr; + list_add(&fmr->fmr->list, &fmr_list); + } + + ret = ib_unmap_fmr(&fmr_list); + if (ret) + pr_warn("RDS/IB: FMR invalidation failed (err=%d)\n", ret); + + /* Now we can destroy the DMA mapping and unpin any pages */ + list_for_each_entry_safe(ibmr, next, list, unmap_list) { + fmr = &ibmr->u.fmr; + *unpinned += ibmr->sg_len; +
[net-next][PATCH v2 09/13] RDS: IB: handle the RDMA CM time wait event
Drop the RDS connection on RDMA_CM_EVENT_TIMEWAIT_EXIT so that it can reconnect and resume. While testing fastreg, this error happened in couple of tests but was getting un-noticed. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/rdma_transport.c | 8 1 file changed, 8 insertions(+) diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c index 4f4b3d8..7220beb 100644 --- a/net/rds/rdma_transport.c +++ b/net/rds/rdma_transport.c @@ -117,6 +117,14 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id, rds_conn_drop(conn); break; + case RDMA_CM_EVENT_TIMEWAIT_EXIT: + if (conn) { + pr_info("RDS: RDMA_CM_EVENT_TIMEWAIT_EXIT event: dropping connection %pI4->%pI4\n", + &conn->c_laddr, &conn->c_faddr); + rds_conn_drop(conn); + } + break; + default: /* things like device disconnect? */ printk(KERN_ERR "RDS: unknown event %u (%s)!\n", -- 1.9.1
[net-next][PATCH v2 08/13] RDS: IB: add connection info to ibmr
Preperatory patch for FRMR support. From connection info, we can retrieve cm_id which contains qp handled needed for work request posting. We also need to drop the RDS connection on QP error states where connection handle becomes useful. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib_mr.h | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h index f5c1fcb..add7725 100644 --- a/net/rds/ib_mr.h +++ b/net/rds/ib_mr.h @@ -50,18 +50,19 @@ struct rds_ib_fmr { /* This is stored as mr->r_trans_private. */ struct rds_ib_mr { - struct rds_ib_device*device; - struct rds_ib_mr_pool *pool; + struct rds_ib_device*device; + struct rds_ib_mr_pool *pool; + struct rds_ib_connection*ic; - struct llist_node llnode; + struct llist_node llnode; /* unmap_list is for freeing */ - struct list_headunmap_list; - unsigned intremap_count; + struct list_headunmap_list; + unsigned intremap_count; - struct scatterlist *sg; - unsigned intsg_len; - int sg_dma_len; + struct scatterlist *sg; + unsigned intsg_len; + int sg_dma_len; union { struct rds_ib_fmr fmr; -- 1.9.1
[net-next][PATCH v2 00/13] RDS: Major clean-up with couple of new features for 4.6
v2: Dropped module parameter from [PATCH 11/13] as suggested by David Miller Series is generated against net-next but also applies against Linus's tip cleanly. Entire patchset is available at below git tree: git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git for_4.6/net-next/rds_v2 The diff-stat looks bit scary since almost ~4K lines of code is getting removed. Brief summary of the series: - Drop the stale iWARP support: RDS iWarp support code has become stale and non testable for sometime. As discussed and agreed earlier on list, am dropping its support for good. If new iWarp user(s) shows up in future, the plan is to adapt existing IB RDMA with special sink case. - RDS gets SO_TIMESTAMP support - Long due RDS maintainer entry gets updated - Some RDS IB code refactoring towards new FastReg Memory registration (FRMR) - Lastly the initial support for FRMR RDS IB RDMA performance with FRMR is not yet as good as FMR and I do have some patches in progress to address that. But they are not ready for 4.6 so I left them out of this series. Also am keeping eye on new CQ API adaptations like other ULPs doing and will try to adapt RDS for the same most likely in 4.7+ timeframe. Santosh Shilimkar (12): RDS: Drop stale iWARP RDMA transport RDS: Add support for SO_TIMESTAMP for incoming messages MAINTAINERS: update RDS entry RDS: IB: Remove the RDS_IB_SEND_OP dependency RDS: IB: Re-organise ibmr code RDS: IB: create struct rds_ib_fmr RDS: IB: move FMR code to its own file RDS: IB: add connection info to ibmr RDS: IB: handle the RDMA CM time wait event RDS: IB: add mr reused stats RDS: IB: add Fastreg MR (FRMR) detection support RDS: IB: allocate extra space on queues for FRMR support Avinash Repaka (1): RDS: IB: Support Fastreg MR (FRMR) memory registration mode Documentation/networking/rds.txt | 4 +- MAINTAINERS | 6 +- net/rds/Kconfig | 7 +- net/rds/Makefile | 4 +- net/rds/af_rds.c | 26 ++ net/rds/ib.c | 47 +- net/rds/ib.h | 37 +- net/rds/ib_cm.c | 59 ++- net/rds/ib_fmr.c | 248 ++ net/rds/ib_frmr.c| 376 +++ net/rds/ib_mr.h | 148 ++ net/rds/ib_rdma.c| 495 ++-- net/rds/ib_send.c| 6 +- net/rds/ib_stats.c | 2 + net/rds/iw.c | 312 - net/rds/iw.h | 398 net/rds/iw_cm.c | 769 -- net/rds/iw_rdma.c| 837 - net/rds/iw_recv.c| 904 net/rds/iw_ring.c| 169 --- net/rds/iw_send.c| 981 --- net/rds/iw_stats.c | 95 net/rds/iw_sysctl.c | 123 - net/rds/rdma_transport.c | 21 +- net/rds/rdma_transport.h | 5 - net/rds/rds.h| 1 + net/rds/recv.c | 20 +- 27 files changed, 1065 insertions(+), 5035 deletions(-) create mode 100644 net/rds/ib_fmr.c create mode 100644 net/rds/ib_frmr.c create mode 100644 net/rds/ib_mr.h delete mode 100644 net/rds/iw.c delete mode 100644 net/rds/iw.h delete mode 100644 net/rds/iw_cm.c delete mode 100644 net/rds/iw_rdma.c delete mode 100644 net/rds/iw_recv.c delete mode 100644 net/rds/iw_ring.c delete mode 100644 net/rds/iw_send.c delete mode 100644 net/rds/iw_stats.c delete mode 100644 net/rds/iw_sysctl.c -- 1.9.1
[net-next][PATCH v2 13/13] RDS: IB: Support Fastreg MR (FRMR) memory registration mode
From: Avinash Repaka Fastreg MR(FRMR) is another method with which one can register memory to HCA. Some of the newer HCAs supports only fastreg mr mode, so we need to add support for it to have RDS functional on them. Signed-off-by: Santosh Shilimkar Signed-off-by: Avinash Repaka Signed-off-by: Santosh Shilimkar --- net/rds/Makefile | 2 +- net/rds/ib.h | 1 + net/rds/ib_cm.c | 7 +- net/rds/ib_frmr.c | 376 ++ net/rds/ib_mr.h | 24 net/rds/ib_rdma.c | 17 ++- 6 files changed, 422 insertions(+), 5 deletions(-) create mode 100644 net/rds/ib_frmr.c diff --git a/net/rds/Makefile b/net/rds/Makefile index bcf5591..0e72bec 100644 --- a/net/rds/Makefile +++ b/net/rds/Makefile @@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o message.o \ obj-$(CONFIG_RDS_RDMA) += rds_rdma.o rds_rdma-y := rdma_transport.o \ ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \ - ib_sysctl.o ib_rdma.o ib_fmr.o + ib_sysctl.o ib_rdma.o ib_fmr.o ib_frmr.o obj-$(CONFIG_RDS_TCP) += rds_tcp.o diff --git a/net/rds/ib.h b/net/rds/ib.h index eeb0d6c..627fb79 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -349,6 +349,7 @@ int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr); void rds_ib_add_conn(struct rds_ib_device *rds_ibdev, struct rds_connection *conn); void rds_ib_remove_conn(struct rds_ib_device *rds_ibdev, struct rds_connection *conn); void rds_ib_destroy_nodev_conns(void); +void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc); /* ib_recv.c */ int rds_ib_recv_init(void); diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index 83f4673..8764970 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -249,7 +249,12 @@ static void poll_scq(struct rds_ib_connection *ic, struct ib_cq *cq, (unsigned long long)wc->wr_id, wc->status, wc->byte_len, be32_to_cpu(wc->ex.imm_data)); - rds_ib_send_cqe_handler(ic, wc); + if (wc->wr_id <= ic->i_send_ring.w_nr || + wc->wr_id == RDS_IB_ACK_WR_ID) + rds_ib_send_cqe_handler(ic, wc); + else + rds_ib_mr_cqe_handler(ic, wc); + } } } diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c new file mode 100644 index 000..93ff038 --- /dev/null +++ b/net/rds/ib_frmr.c @@ -0,0 +1,376 @@ +/* + * Copyright (c) 2016 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + *copyright notice, this list of conditions and the following + *disclaimer. + * + * - Redistributions in binary form must reproduce the above + *copyright notice, this list of conditions and the following + *disclaimer in the documentation and/or other materials + *provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include "ib_mr.h" + +static struct rds_ib_mr *rds_ib_alloc_frmr(struct rds_ib_device *rds_ibdev, + int npages) +{ + struct rds_ib_mr_pool *pool; + struct rds_ib_mr *ibmr = NULL; + struct rds_ib_frmr *frmr; + int err = 0; + + if (npages <= RDS_MR_8K_MSG_SIZE) + pool = rds_ibdev->mr_8k_pool; + else + pool = rds_ibdev->mr_1m_pool; + + ibmr = rds_ib_try_reuse_ibmr(pool); + if (ibmr) + return ibmr; + + ibmr = kzalloc_node(sizeof(*ibmr), GFP_KERNEL, + rdsibdev_to_node(rds_ibdev)); + if (!ibmr) { + err = -ENOMEM; + goto out_no_cigar; + } + + frmr = &ibmr->u.frmr; + frmr->mr = ib_alloc_mr(rds_ibdev->pd, IB_MR_TYPE_MEM_REG, +pool->fmr_att
[net-next][PATCH v2 12/13] RDS: IB: allocate extra space on queues for FRMR support
Fastreg MR(FRMR) memory registration and invalidation makes use of work request and completion queues for its operation. Patch allocates extra queue space towards these operation(s). Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib.h| 4 net/rds/ib_cm.c | 16 2 files changed, 16 insertions(+), 4 deletions(-) diff --git a/net/rds/ib.h b/net/rds/ib.h index c5eddc2..eeb0d6c 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -14,6 +14,7 @@ #define RDS_IB_DEFAULT_RECV_WR 1024 #define RDS_IB_DEFAULT_SEND_WR 256 +#define RDS_IB_DEFAULT_FR_WR 512 #define RDS_IB_DEFAULT_RETRY_COUNT 2 @@ -122,6 +123,9 @@ struct rds_ib_connection { struct ib_wci_send_wc[RDS_IB_WC_MAX]; struct ib_wci_recv_wc[RDS_IB_WC_MAX]; + /* To control the number of wrs from fastreg */ + atomic_ti_fastreg_wrs; + /* interrupt handling */ struct tasklet_struct i_send_tasklet; struct tasklet_struct i_recv_tasklet; diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index 7f68abc..83f4673 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -363,7 +363,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) struct ib_qp_init_attr attr; struct ib_cq_init_attr cq_attr = {}; struct rds_ib_device *rds_ibdev; - int ret; + int ret, fr_queue_space; /* * It's normal to see a null device if an incoming connection races @@ -373,6 +373,12 @@ static int rds_ib_setup_qp(struct rds_connection *conn) if (!rds_ibdev) return -EOPNOTSUPP; + /* The fr_queue_space is currently set to 512, to add extra space on +* completion queue and send queue. This extra space is used for FRMR +* registration and invalidation work requests +*/ + fr_queue_space = (rds_ibdev->use_fastreg ? RDS_IB_DEFAULT_FR_WR : 0); + /* add the conn now so that connection establishment has the dev */ rds_ib_add_conn(rds_ibdev, conn); @@ -384,7 +390,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) /* Protection domain and memory range */ ic->i_pd = rds_ibdev->pd; - cq_attr.cqe = ic->i_send_ring.w_nr + 1; + cq_attr.cqe = ic->i_send_ring.w_nr + fr_queue_space + 1; ic->i_send_cq = ib_create_cq(dev, rds_ib_cq_comp_handler_send, rds_ib_cq_event_handler, conn, @@ -424,7 +430,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) attr.event_handler = rds_ib_qp_event_handler; attr.qp_context = conn; /* + 1 to allow for the single ack message */ - attr.cap.max_send_wr = ic->i_send_ring.w_nr + 1; + attr.cap.max_send_wr = ic->i_send_ring.w_nr + fr_queue_space + 1; attr.cap.max_recv_wr = ic->i_recv_ring.w_nr + 1; attr.cap.max_send_sge = rds_ibdev->max_sge; attr.cap.max_recv_sge = RDS_IB_RECV_SGE; @@ -432,6 +438,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) attr.qp_type = IB_QPT_RC; attr.send_cq = ic->i_send_cq; attr.recv_cq = ic->i_recv_cq; + atomic_set(&ic->i_fastreg_wrs, RDS_IB_DEFAULT_FR_WR); /* * XXX this can fail if max_*_wr is too large? Are we supposed @@ -751,7 +758,8 @@ void rds_ib_conn_shutdown(struct rds_connection *conn) */ wait_event(rds_ib_ring_empty_wait, rds_ib_ring_empty(&ic->i_recv_ring) && - (atomic_read(&ic->i_signaled_sends) == 0)); + (atomic_read(&ic->i_signaled_sends) == 0) && + (atomic_read(&ic->i_fastreg_wrs) == RDS_IB_DEFAULT_FR_WR)); tasklet_kill(&ic->i_send_tasklet); tasklet_kill(&ic->i_recv_tasklet); -- 1.9.1
[net-next][PATCH v2 11/13] RDS: IB: add Fastreg MR (FRMR) detection support
Discovere Fast Memmory Registration support using IB device IB_DEVICE_MEM_MGT_EXTENSIONS. Certain HCA might support just FRMR or FMR or both FMR and FRWR. In case both mr type are supported, default FMR is used. Default MR is still kept as FMR against what everyone else is following. Default will be changed to FRMR once the RDS performance with FRMR is comparable with FMR. The work is in progress for the same. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- v2: Dropped the module parameter as suggested by David Miller net/rds/ib.c| 10 ++ net/rds/ib.h| 4 net/rds/ib_mr.h | 1 + 3 files changed, 15 insertions(+) diff --git a/net/rds/ib.c b/net/rds/ib.c index bb32cb9..b5342fd 100644 --- a/net/rds/ib.c +++ b/net/rds/ib.c @@ -140,6 +140,12 @@ static void rds_ib_add_one(struct ib_device *device) rds_ibdev->max_wrs = device->attrs.max_qp_wr; rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE); + rds_ibdev->has_fr = (device->attrs.device_cap_flags & + IB_DEVICE_MEM_MGT_EXTENSIONS); + rds_ibdev->has_fmr = (device->alloc_fmr && device->dealloc_fmr && + device->map_phys_fmr && device->unmap_fmr); + rds_ibdev->use_fastreg = (rds_ibdev->has_fr && !rds_ibdev->has_fmr); + rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32; rds_ibdev->max_1m_mrs = device->attrs.max_mr ? min_t(unsigned int, (device->attrs.max_mr / 2), @@ -178,6 +184,10 @@ static void rds_ib_add_one(struct ib_device *device) rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs, rds_ibdev->max_8k_mrs); + pr_info("RDS/IB: %s: %s supported and preferred\n", + device->name, + rds_ibdev->use_fastreg ? "FRMR" : "FMR"); + INIT_LIST_HEAD(&rds_ibdev->ipaddr_list); INIT_LIST_HEAD(&rds_ibdev->conn_list); diff --git a/net/rds/ib.h b/net/rds/ib.h index 62fe7d5..c5eddc2 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -200,6 +200,10 @@ struct rds_ib_device { struct list_headconn_list; struct ib_device*dev; struct ib_pd*pd; + boolhas_fmr; + boolhas_fr; + booluse_fastreg; + unsigned intmax_mrs; struct rds_ib_mr_pool *mr_1m_pool; struct rds_ib_mr_pool *mr_8k_pool; diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h index add7725..2f9b9c3 100644 --- a/net/rds/ib_mr.h +++ b/net/rds/ib_mr.h @@ -93,6 +93,7 @@ struct rds_ib_mr_pool { extern struct workqueue_struct *rds_ib_mr_wq; extern unsigned int rds_ib_mr_1m_pool_size; extern unsigned int rds_ib_mr_8k_pool_size; +extern bool prefer_frmr; struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *rds_dev, int npages); -- 1.9.1
[net-next][PATCH 03/13] MAINTAINERS: update RDS entry
Acked-by: Chien Yen Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- MAINTAINERS | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 355e1c8..9d79bea 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9081,10 +9081,14 @@ S: Maintained F: drivers/net/ethernet/rdc/r6040.c RDS - RELIABLE DATAGRAM SOCKETS -M: Chien Yen +M: Santosh Shilimkar +L: net...@vger.kernel.org +L: linux-r...@vger.kernel.org L: rds-de...@oss.oracle.com (moderated for non-subscribers) +W: https://oss.oracle.com/projects/rds/ S: Supported F: net/rds/ +F: Documentation/networking/rds.txt READ-COPY UPDATE (RCU) M: "Paul E. McKenney" -- 1.9.1
[net-next][PATCH 01/13] RDS: Drop stale iWARP RDMA transport
RDS iWarp support code has become stale and non testable. As indicated earlier, am dropping the support for it. If new iWarp user(s) shows up in future, we can adapat the RDS IB transprt for the special RDMA READ sink case. iWarp needs an MR for the RDMA READ sink. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- Documentation/networking/rds.txt | 4 +- net/rds/Kconfig | 7 +- net/rds/Makefile | 4 +- net/rds/iw.c | 312 - net/rds/iw.h | 398 net/rds/iw_cm.c | 769 -- net/rds/iw_rdma.c| 837 - net/rds/iw_recv.c| 904 net/rds/iw_ring.c| 169 --- net/rds/iw_send.c| 981 --- net/rds/iw_stats.c | 95 net/rds/iw_sysctl.c | 123 - net/rds/rdma_transport.c | 13 +- net/rds/rdma_transport.h | 5 - 14 files changed, 7 insertions(+), 4614 deletions(-) delete mode 100644 net/rds/iw.c delete mode 100644 net/rds/iw.h delete mode 100644 net/rds/iw_cm.c delete mode 100644 net/rds/iw_rdma.c delete mode 100644 net/rds/iw_recv.c delete mode 100644 net/rds/iw_ring.c delete mode 100644 net/rds/iw_send.c delete mode 100644 net/rds/iw_stats.c delete mode 100644 net/rds/iw_sysctl.c diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt index e1a3d59..9d219d8 100644 --- a/Documentation/networking/rds.txt +++ b/Documentation/networking/rds.txt @@ -19,9 +19,7 @@ to N*N if you use a connection-oriented socket transport like TCP. RDS is not Infiniband-specific; it was designed to support different transports. The current implementation used to support RDS over TCP as well -as IB. Work is in progress to support RDS over iWARP, and using DCE to -guarantee no dropped packets on Ethernet, it may be possible to use RDS over -UDP in the future. +as IB. The high-level semantics of RDS from the application's point of view are diff --git a/net/rds/Kconfig b/net/rds/Kconfig index f2c670b..bffde4b 100644 --- a/net/rds/Kconfig +++ b/net/rds/Kconfig @@ -4,14 +4,13 @@ config RDS depends on INET ---help--- The RDS (Reliable Datagram Sockets) protocol provides reliable, - sequenced delivery of datagrams over Infiniband, iWARP, - or TCP. + sequenced delivery of datagrams over Infiniband or TCP. config RDS_RDMA - tristate "RDS over Infiniband and iWARP" + tristate "RDS over Infiniband" depends on RDS && INFINIBAND && INFINIBAND_ADDR_TRANS ---help--- - Allow RDS to use Infiniband and iWARP as a transport. + Allow RDS to use Infiniband as a transport. This transport supports RDMA operations. config RDS_TCP diff --git a/net/rds/Makefile b/net/rds/Makefile index 56d3f60..19e5485 100644 --- a/net/rds/Makefile +++ b/net/rds/Makefile @@ -6,9 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o message.o \ obj-$(CONFIG_RDS_RDMA) += rds_rdma.o rds_rdma-y := rdma_transport.o \ ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \ - ib_sysctl.o ib_rdma.o \ - iw.o iw_cm.o iw_recv.o iw_ring.o iw_send.o iw_stats.o \ - iw_sysctl.o iw_rdma.o + ib_sysctl.o ib_rdma.o obj-$(CONFIG_RDS_TCP) += rds_tcp.o diff --git a/net/rds/iw.c b/net/rds/iw.c deleted file mode 100644 index f4a9fff..000 diff --git a/net/rds/iw.h b/net/rds/iw.h deleted file mode 100644 index 5af01d1..000 diff --git a/net/rds/iw_cm.c b/net/rds/iw_cm.c deleted file mode 100644 index aea4c91..000 diff --git a/net/rds/iw_rdma.c b/net/rds/iw_rdma.c deleted file mode 100644 index b09a40c..000 diff --git a/net/rds/iw_recv.c b/net/rds/iw_recv.c deleted file mode 100644 index a66d179..000 diff --git a/net/rds/iw_ring.c b/net/rds/iw_ring.c deleted file mode 100644 index da8e3b6..000 diff --git a/net/rds/iw_send.c b/net/rds/iw_send.c deleted file mode 100644 index e20bd50..000 diff --git a/net/rds/iw_stats.c b/net/rds/iw_stats.c deleted file mode 100644 index 5fe67f6..000 diff --git a/net/rds/iw_sysctl.c b/net/rds/iw_sysctl.c deleted file mode 100644 index 139239d..000 diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c index 9c1fed8..4f4b3d8 100644 --- a/net/rds/rdma_transport.c +++ b/net/rds/rdma_transport.c @@ -49,9 +49,7 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id, rdsdebug("conn %p id %p handling event %u (%s)\n", conn, cm_id, event->event, rdma_event_msg(event->event)); - if (cm_id->device->node_type == RDMA_NODE_RNIC) - trans = &
[net-next][PATCH 02/13] RDS: Add support for SO_TIMESTAMP for incoming messages
The SO_TIMESTAMP generates time stamp for each incoming RDS messages User app can enable it by using SO_TIMESTAMP setsocketopt() at SOL_SOCKET level. CMSG data of cmsg type SO_TIMESTAMP contains the time stamp in struct timeval format. Reviewed-by: Sowmini Varadhan Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/af_rds.c | 26 ++ net/rds/rds.h| 1 + net/rds/recv.c | 20 ++-- 3 files changed, 45 insertions(+), 2 deletions(-) diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c index b5476aeb..6beaeb1 100644 --- a/net/rds/af_rds.c +++ b/net/rds/af_rds.c @@ -277,6 +277,27 @@ static int rds_set_transport(struct rds_sock *rs, char __user *optval, return rs->rs_transport ? 0 : -ENOPROTOOPT; } +static int rds_enable_recvtstamp(struct sock *sk, char __user *optval, +int optlen) +{ + int val, valbool; + + if (optlen != sizeof(int)) + return -EFAULT; + + if (get_user(val, (int __user *)optval)) + return -EFAULT; + + valbool = val ? 1 : 0; + + if (valbool) + sock_set_flag(sk, SOCK_RCVTSTAMP); + else + sock_reset_flag(sk, SOCK_RCVTSTAMP); + + return 0; +} + static int rds_setsockopt(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen) { @@ -312,6 +333,11 @@ static int rds_setsockopt(struct socket *sock, int level, int optname, ret = rds_set_transport(rs, optval, optlen); release_sock(sock->sk); break; + case SO_TIMESTAMP: + lock_sock(sock->sk); + ret = rds_enable_recvtstamp(sock->sk, optval, optlen); + release_sock(sock->sk); + break; default: ret = -ENOPROTOOPT; } diff --git a/net/rds/rds.h b/net/rds/rds.h index 0e2797b..80256b0 100644 --- a/net/rds/rds.h +++ b/net/rds/rds.h @@ -222,6 +222,7 @@ struct rds_incoming { __be32 i_saddr; rds_rdma_cookie_t i_rdma_cookie; + struct timeval i_rx_tstamp; }; struct rds_mr { diff --git a/net/rds/recv.c b/net/rds/recv.c index a00462b..c0be1ec 100644 --- a/net/rds/recv.c +++ b/net/rds/recv.c @@ -35,6 +35,8 @@ #include #include #include +#include +#include #include "rds.h" @@ -46,6 +48,8 @@ void rds_inc_init(struct rds_incoming *inc, struct rds_connection *conn, inc->i_conn = conn; inc->i_saddr = saddr; inc->i_rdma_cookie = 0; + inc->i_rx_tstamp.tv_sec = 0; + inc->i_rx_tstamp.tv_usec = 0; } EXPORT_SYMBOL_GPL(rds_inc_init); @@ -228,6 +232,8 @@ void rds_recv_incoming(struct rds_connection *conn, __be32 saddr, __be32 daddr, rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong, be32_to_cpu(inc->i_hdr.h_len), inc->i_hdr.h_dport); + if (sock_flag(sk, SOCK_RCVTSTAMP)) + do_gettimeofday(&inc->i_rx_tstamp); rds_inc_addref(inc); list_add_tail(&inc->i_item, &rs->rs_recv_queue); __rds_wake_sk_sleep(sk); @@ -381,7 +387,8 @@ static int rds_notify_cong(struct rds_sock *rs, struct msghdr *msghdr) /* * Receive any control messages. */ -static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg) +static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg, +struct rds_sock *rs) { int ret = 0; @@ -392,6 +399,15 @@ static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg) return ret; } + if ((inc->i_rx_tstamp.tv_sec != 0) && + sock_flag(rds_rs_to_sk(rs), SOCK_RCVTSTAMP)) { + ret = put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMP, + sizeof(struct timeval), + &inc->i_rx_tstamp); + if (ret) + return ret; + } + return 0; } @@ -474,7 +490,7 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, msg->msg_flags |= MSG_TRUNC; } - if (rds_cmsg_recv(inc, msg)) { + if (rds_cmsg_recv(inc, msg, rs)) { ret = -EFAULT; goto out; } -- 1.9.1
[net-next][PATCH 05/13] RDS: IB: Re-organise ibmr code
No functional changes. This is in preperation towards adding fastreg memory resgitration support. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/Makefile | 2 +- net/rds/ib.c | 37 +++--- net/rds/ib.h | 25 +--- net/rds/ib_fmr.c | 217 +++ net/rds/ib_mr.h | 109 net/rds/ib_rdma.c | 379 +++--- 6 files changed, 422 insertions(+), 347 deletions(-) create mode 100644 net/rds/ib_fmr.c create mode 100644 net/rds/ib_mr.h diff --git a/net/rds/Makefile b/net/rds/Makefile index 19e5485..bcf5591 100644 --- a/net/rds/Makefile +++ b/net/rds/Makefile @@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o message.o \ obj-$(CONFIG_RDS_RDMA) += rds_rdma.o rds_rdma-y := rdma_transport.o \ ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \ - ib_sysctl.o ib_rdma.o + ib_sysctl.o ib_rdma.o ib_fmr.o obj-$(CONFIG_RDS_TCP) += rds_tcp.o diff --git a/net/rds/ib.c b/net/rds/ib.c index 9481d55..bb32cb9 100644 --- a/net/rds/ib.c +++ b/net/rds/ib.c @@ -42,15 +42,16 @@ #include "rds.h" #include "ib.h" +#include "ib_mr.h" -unsigned int rds_ib_fmr_1m_pool_size = RDS_FMR_1M_POOL_SIZE; -unsigned int rds_ib_fmr_8k_pool_size = RDS_FMR_8K_POOL_SIZE; +unsigned int rds_ib_mr_1m_pool_size = RDS_MR_1M_POOL_SIZE; +unsigned int rds_ib_mr_8k_pool_size = RDS_MR_8K_POOL_SIZE; unsigned int rds_ib_retry_count = RDS_IB_DEFAULT_RETRY_COUNT; -module_param(rds_ib_fmr_1m_pool_size, int, 0444); -MODULE_PARM_DESC(rds_ib_fmr_1m_pool_size, " Max number of 1M fmr per HCA"); -module_param(rds_ib_fmr_8k_pool_size, int, 0444); -MODULE_PARM_DESC(rds_ib_fmr_8k_pool_size, " Max number of 8K fmr per HCA"); +module_param(rds_ib_mr_1m_pool_size, int, 0444); +MODULE_PARM_DESC(rds_ib_mr_1m_pool_size, " Max number of 1M mr per HCA"); +module_param(rds_ib_mr_8k_pool_size, int, 0444); +MODULE_PARM_DESC(rds_ib_mr_8k_pool_size, " Max number of 8K mr per HCA"); module_param(rds_ib_retry_count, int, 0444); MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before reporting an error"); @@ -140,13 +141,13 @@ static void rds_ib_add_one(struct ib_device *device) rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE); rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32; - rds_ibdev->max_1m_fmrs = device->attrs.max_mr ? + rds_ibdev->max_1m_mrs = device->attrs.max_mr ? min_t(unsigned int, (device->attrs.max_mr / 2), - rds_ib_fmr_1m_pool_size) : rds_ib_fmr_1m_pool_size; + rds_ib_mr_1m_pool_size) : rds_ib_mr_1m_pool_size; - rds_ibdev->max_8k_fmrs = device->attrs.max_mr ? + rds_ibdev->max_8k_mrs = device->attrs.max_mr ? min_t(unsigned int, ((device->attrs.max_mr / 2) * RDS_MR_8K_SCALE), - rds_ib_fmr_8k_pool_size) : rds_ib_fmr_8k_pool_size; + rds_ib_mr_8k_pool_size) : rds_ib_mr_8k_pool_size; rds_ibdev->max_initiator_depth = device->attrs.max_qp_init_rd_atom; rds_ibdev->max_responder_resources = device->attrs.max_qp_rd_atom; @@ -172,10 +173,10 @@ static void rds_ib_add_one(struct ib_device *device) goto put_dev; } - rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, fmr_max_remaps = %d, max_1m_fmrs = %d, max_8k_fmrs = %d\n", + rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, fmr_max_remaps = %d, max_1m_mrs = %d, max_8k_mrs = %d\n", device->attrs.max_fmr, rds_ibdev->max_wrs, rds_ibdev->max_sge, -rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_fmrs, -rds_ibdev->max_8k_fmrs); +rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs, +rds_ibdev->max_8k_mrs); INIT_LIST_HEAD(&rds_ibdev->ipaddr_list); INIT_LIST_HEAD(&rds_ibdev->conn_list); @@ -364,7 +365,7 @@ void rds_ib_exit(void) rds_ib_sysctl_exit(); rds_ib_recv_exit(); rds_trans_unregister(&rds_ib_transport); - rds_ib_fmr_exit(); + rds_ib_mr_exit(); } struct rds_transport rds_ib_transport = { @@ -400,13 +401,13 @@ int rds_ib_init(void) INIT_LIST_HEAD(&rds_ib_devices); - ret = rds_ib_fmr_init(); + ret = rds_ib_mr_init(); if (ret) goto out; ret = ib_register_client(&rds_ib_client); if (ret) - goto out_fmr_exit; + goto out_mr_exit; ret = rds_ib_sysctl_init(); if (ret) @@ -430,8 +431,8 @@ out_sysctl: rds_ib_sysctl_exit(); out_ibreg: rds_ib_unregister_client(); -out_fmr_exi
[net-next][PATCH 11/13] RDS: IB: add Fastreg MR (FRMR) detection support
Discovere Fast Memmory Registration support using IB device IB_DEVICE_MEM_MGT_EXTENSIONS. Certain HCA might support just FRMR or FMR or both FMR and FRWR. In case both mr type are supported, default FMR is used. Using module parameter 'prefer_frmr', user can choose its preferred MR method for RDS. Ofcourse the module parameter has no effect if the HCA support only FRMR or only FRMR. Default MR is still kept as FMR against what everyone else is following. Default will be changed to FRMR once the RDS performance with FRMR is comparable with FMR. The work is in progress for the same. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib.c| 14 ++ net/rds/ib.h| 4 net/rds/ib_mr.h | 1 + 3 files changed, 19 insertions(+) diff --git a/net/rds/ib.c b/net/rds/ib.c index bb32cb9..68c94b0 100644 --- a/net/rds/ib.c +++ b/net/rds/ib.c @@ -47,6 +47,7 @@ unsigned int rds_ib_mr_1m_pool_size = RDS_MR_1M_POOL_SIZE; unsigned int rds_ib_mr_8k_pool_size = RDS_MR_8K_POOL_SIZE; unsigned int rds_ib_retry_count = RDS_IB_DEFAULT_RETRY_COUNT; +bool prefer_frmr; module_param(rds_ib_mr_1m_pool_size, int, 0444); MODULE_PARM_DESC(rds_ib_mr_1m_pool_size, " Max number of 1M mr per HCA"); @@ -54,6 +55,8 @@ module_param(rds_ib_mr_8k_pool_size, int, 0444); MODULE_PARM_DESC(rds_ib_mr_8k_pool_size, " Max number of 8K mr per HCA"); module_param(rds_ib_retry_count, int, 0444); MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before reporting an error"); +module_param(prefer_frmr, bool, 0444); +MODULE_PARM_DESC(prefer_frmr, "Preferred MR method if both FMR and FRMR supported"); /* * we have a clumsy combination of RCU and a rwsem protecting this list @@ -140,6 +143,13 @@ static void rds_ib_add_one(struct ib_device *device) rds_ibdev->max_wrs = device->attrs.max_qp_wr; rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE); + rds_ibdev->has_fr = (device->attrs.device_cap_flags & + IB_DEVICE_MEM_MGT_EXTENSIONS); + rds_ibdev->has_fmr = (device->alloc_fmr && device->dealloc_fmr && + device->map_phys_fmr && device->unmap_fmr); + rds_ibdev->use_fastreg = (rds_ibdev->has_fr && +(!rds_ibdev->has_fmr || prefer_frmr)); + rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32; rds_ibdev->max_1m_mrs = device->attrs.max_mr ? min_t(unsigned int, (device->attrs.max_mr / 2), @@ -178,6 +188,10 @@ static void rds_ib_add_one(struct ib_device *device) rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs, rds_ibdev->max_8k_mrs); + pr_info("RDS/IB: %s: %s supported and preferred\n", + device->name, + rds_ibdev->use_fastreg ? "FRMR" : "FMR"); + INIT_LIST_HEAD(&rds_ibdev->ipaddr_list); INIT_LIST_HEAD(&rds_ibdev->conn_list); diff --git a/net/rds/ib.h b/net/rds/ib.h index 62fe7d5..c5eddc2 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -200,6 +200,10 @@ struct rds_ib_device { struct list_headconn_list; struct ib_device*dev; struct ib_pd*pd; + boolhas_fmr; + boolhas_fr; + booluse_fastreg; + unsigned intmax_mrs; struct rds_ib_mr_pool *mr_1m_pool; struct rds_ib_mr_pool *mr_8k_pool; diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h index add7725..2f9b9c3 100644 --- a/net/rds/ib_mr.h +++ b/net/rds/ib_mr.h @@ -93,6 +93,7 @@ struct rds_ib_mr_pool { extern struct workqueue_struct *rds_ib_mr_wq; extern unsigned int rds_ib_mr_1m_pool_size; extern unsigned int rds_ib_mr_8k_pool_size; +extern bool prefer_frmr; struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *rds_dev, int npages); -- 1.9.1
[net-next][PATCH 09/13] RDS: IB: handle the RDMA CM time wait event
Drop the RDS connection on RDMA_CM_EVENT_TIMEWAIT_EXIT so that it can reconnect and resume. While testing fastreg, this error happened in couple of tests but was getting un-noticed. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/rdma_transport.c | 8 1 file changed, 8 insertions(+) diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c index 4f4b3d8..7220beb 100644 --- a/net/rds/rdma_transport.c +++ b/net/rds/rdma_transport.c @@ -117,6 +117,14 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id, rds_conn_drop(conn); break; + case RDMA_CM_EVENT_TIMEWAIT_EXIT: + if (conn) { + pr_info("RDS: RDMA_CM_EVENT_TIMEWAIT_EXIT event: dropping connection %pI4->%pI4\n", + &conn->c_laddr, &conn->c_faddr); + rds_conn_drop(conn); + } + break; + default: /* things like device disconnect? */ printk(KERN_ERR "RDS: unknown event %u (%s)!\n", -- 1.9.1
[net-next][PATCH 00/13] RDS: Major clean-up with couple of new features for 4.6
Series is generated against net-next but also applies against Linus's tip cleanly. The diff-stat looks bit scary since almost ~4K lines of code is getting removed. Brief summary of the series: - Drop the stale iWARP support: RDS iWarp support code has become stale and non testable for sometime. As discussed and agreed earlier on list [1], am dropping its support for good. If new iWarp user(s) shows up in future, the plan is to adapt existing IB RDMA with special sink case. - RDS gets SO_TIMESTAMP support - Long due RDS maintainer entry gets updated - Some RDS IB code refactoring towards new FastReg Memory registration (FRMR) - Lastly the initial support for FRMR RDS IB RDMA performance with FRMR is not yet as good as FMR and I do have some patches in progress to address that. But they are not ready for 4.6 so I left them out of this series. Also am keeping eye on new CQ API adaptations like other ULPs doing and will try to adapt RDS for the same most likely in 4.7 timeframe. Entire patchset is available below git tree: git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git for_4.6/net-next/rds Feedback/comments welcome !! Santosh Shilimkar (12): RDS: Drop stale iWARP RDMA transport RDS: Add support for SO_TIMESTAMP for incoming messages MAINTAINERS: update RDS entry RDS: IB: Remove the RDS_IB_SEND_OP dependency RDS: IB: Re-organise ibmr code RDS: IB: create struct rds_ib_fmr RDS: IB: move FMR code to its own file RDS: IB: add connection info to ibmr RDS: IB: handle the RDMA CM time wait event RDS: IB: add mr reused stats RDS: IB: add Fastreg MR (FRMR) detection support RDS: IB: allocate extra space on queues for FRMR support Avinash Repaka (1): RDS: IB: Support Fastreg MR (FRMR) memory registration mode Documentation/networking/rds.txt | 4 +- MAINTAINERS | 6 +- net/rds/Kconfig | 7 +- net/rds/Makefile | 4 +- net/rds/af_rds.c | 26 ++ net/rds/ib.c | 51 +- net/rds/ib.h | 37 +- net/rds/ib_cm.c | 59 ++- net/rds/ib_fmr.c | 248 ++ net/rds/ib_frmr.c| 376 +++ net/rds/ib_mr.h | 148 ++ net/rds/ib_rdma.c| 492 ++-- net/rds/ib_send.c| 6 +- net/rds/ib_stats.c | 2 + net/rds/iw.c | 312 - net/rds/iw.h | 398 net/rds/iw_cm.c | 769 -- net/rds/iw_rdma.c| 837 - net/rds/iw_recv.c| 904 net/rds/iw_ring.c| 169 --- net/rds/iw_send.c| 981 --- net/rds/iw_stats.c | 95 net/rds/iw_sysctl.c | 123 - net/rds/rdma_transport.c | 21 +- net/rds/rdma_transport.h | 5 - net/rds/rds.h| 1 + net/rds/recv.c | 20 +- 27 files changed, 1068 insertions(+), 5033 deletions(-) create mode 100644 net/rds/ib_fmr.c create mode 100644 net/rds/ib_frmr.c create mode 100644 net/rds/ib_mr.h delete mode 100644 net/rds/iw.c delete mode 100644 net/rds/iw.h delete mode 100644 net/rds/iw_cm.c delete mode 100644 net/rds/iw_rdma.c delete mode 100644 net/rds/iw_recv.c delete mode 100644 net/rds/iw_ring.c delete mode 100644 net/rds/iw_send.c delete mode 100644 net/rds/iw_stats.c delete mode 100644 net/rds/iw_sysctl.c Regards, Santosh [1] http://www.spinics.net/lists/linux-rdma/msg30769.html -- 1.9.1
[net-next][PATCH 10/13] RDS: IB: add mr reused stats
Add MR reuse statistics to RDS IB transport. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib.h | 2 ++ net/rds/ib_rdma.c | 7 ++- net/rds/ib_stats.c | 2 ++ 3 files changed, 10 insertions(+), 1 deletion(-) diff --git a/net/rds/ib.h b/net/rds/ib.h index c88cb22..62fe7d5 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -259,6 +259,8 @@ struct rds_ib_statistics { uint64_ts_ib_rdma_mr_1m_pool_flush; uint64_ts_ib_rdma_mr_1m_pool_wait; uint64_ts_ib_rdma_mr_1m_pool_depleted; + uint64_ts_ib_rdma_mr_8k_reused; + uint64_ts_ib_rdma_mr_1m_reused; uint64_ts_ib_atomic_cswp; uint64_ts_ib_atomic_fadd; }; diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index 20ff191..00e9064 100644 --- a/net/rds/ib_rdma.c +++ b/net/rds/ib_rdma.c @@ -188,8 +188,13 @@ struct rds_ib_mr *rds_ib_reuse_mr(struct rds_ib_mr_pool *pool) flag = this_cpu_ptr(&clean_list_grace); set_bit(CLEAN_LIST_BUSY_BIT, flag); ret = llist_del_first(&pool->clean_list); - if (ret) + if (ret) { ibmr = llist_entry(ret, struct rds_ib_mr, llnode); + if (pool->pool_type == RDS_IB_MR_8K_POOL) + rds_ib_stats_inc(s_ib_rdma_mr_8k_reused); + else + rds_ib_stats_inc(s_ib_rdma_mr_1m_reused); + } clear_bit(CLEAN_LIST_BUSY_BIT, flag); preempt_enable(); diff --git a/net/rds/ib_stats.c b/net/rds/ib_stats.c index d77e044..7e78dca 100644 --- a/net/rds/ib_stats.c +++ b/net/rds/ib_stats.c @@ -73,6 +73,8 @@ static const char *const rds_ib_stat_names[] = { "ib_rdma_mr_1m_pool_flush", "ib_rdma_mr_1m_pool_wait", "ib_rdma_mr_1m_pool_depleted", + "ib_rdma_mr_8k_reused", + "ib_rdma_mr_1m_reused", "ib_atomic_cswp", "ib_atomic_fadd", }; -- 1.9.1
[net-next][PATCH 13/13] RDS: IB: Support Fastreg MR (FRMR) memory registration mode
From: Avinash Repaka Fastreg MR(FRMR) is another method with which one can register memory to HCA. Some of the newer HCAs supports only fastreg mr mode, so we need to add support for it to RDS to have RDS functional on them. Some of the older HCAs support both FMR and FRMR modes. So to try out FRMR on older HCAs, one can use module parameter 'prefer_frmr' Signed-off-by: Santosh Shilimkar Signed-off-by: Avinash Repaka Signed-off-by: Santosh Shilimkar --- RDS IB RDMA performance with FRMR is not yet as good as FMR and I do have some patches in progress to address that. But they are not ready for 4.6 so I left them out of this series. net/rds/Makefile | 2 +- net/rds/ib.h | 1 + net/rds/ib_cm.c | 7 +- net/rds/ib_frmr.c | 376 ++ net/rds/ib_mr.h | 24 net/rds/ib_rdma.c | 17 ++- 6 files changed, 422 insertions(+), 5 deletions(-) create mode 100644 net/rds/ib_frmr.c diff --git a/net/rds/Makefile b/net/rds/Makefile index bcf5591..0e72bec 100644 --- a/net/rds/Makefile +++ b/net/rds/Makefile @@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o message.o \ obj-$(CONFIG_RDS_RDMA) += rds_rdma.o rds_rdma-y := rdma_transport.o \ ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \ - ib_sysctl.o ib_rdma.o ib_fmr.o + ib_sysctl.o ib_rdma.o ib_fmr.o ib_frmr.o obj-$(CONFIG_RDS_TCP) += rds_tcp.o diff --git a/net/rds/ib.h b/net/rds/ib.h index eeb0d6c..627fb79 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -349,6 +349,7 @@ int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr); void rds_ib_add_conn(struct rds_ib_device *rds_ibdev, struct rds_connection *conn); void rds_ib_remove_conn(struct rds_ib_device *rds_ibdev, struct rds_connection *conn); void rds_ib_destroy_nodev_conns(void); +void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc); /* ib_recv.c */ int rds_ib_recv_init(void); diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index 83f4673..8764970 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -249,7 +249,12 @@ static void poll_scq(struct rds_ib_connection *ic, struct ib_cq *cq, (unsigned long long)wc->wr_id, wc->status, wc->byte_len, be32_to_cpu(wc->ex.imm_data)); - rds_ib_send_cqe_handler(ic, wc); + if (wc->wr_id <= ic->i_send_ring.w_nr || + wc->wr_id == RDS_IB_ACK_WR_ID) + rds_ib_send_cqe_handler(ic, wc); + else + rds_ib_mr_cqe_handler(ic, wc); + } } } diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c new file mode 100644 index 000..a86de13 --- /dev/null +++ b/net/rds/ib_frmr.c @@ -0,0 +1,376 @@ +/* + * Copyright (c) 2016 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + *copyright notice, this list of conditions and the following + *disclaimer. + * + * - Redistributions in binary form must reproduce the above + *copyright notice, this list of conditions and the following + *disclaimer in the documentation and/or other materials + *provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include "ib_mr.h" + +static struct rds_ib_mr *rds_ib_alloc_frmr(struct rds_ib_device *rds_ibdev, + int npages) +{ + struct rds_ib_mr_pool *pool; + struct rds_ib_mr *ibmr = NULL; + struct rds_ib_frmr *frmr; + int err = 0; + + if (npages <= RDS_MR_8K_MSG_SIZE) + pool = rds_ibdev->mr_8k_pool; + else + pool = rds_ibdev->mr_1m_pool; + + ibmr = rds_ib_try_reuse_ibmr(pool); + if (ibmr) + retur
[net-next][PATCH 12/13] RDS: IB: allocate extra space on queues for FRMR support
Fastreg MR(FRMR) memory registration and invalidation makes use of work request and completion queues for its operation. Patch allocates extra queue space towards these operation(s). Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib.h| 4 net/rds/ib_cm.c | 16 2 files changed, 16 insertions(+), 4 deletions(-) diff --git a/net/rds/ib.h b/net/rds/ib.h index c5eddc2..eeb0d6c 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -14,6 +14,7 @@ #define RDS_IB_DEFAULT_RECV_WR 1024 #define RDS_IB_DEFAULT_SEND_WR 256 +#define RDS_IB_DEFAULT_FR_WR 512 #define RDS_IB_DEFAULT_RETRY_COUNT 2 @@ -122,6 +123,9 @@ struct rds_ib_connection { struct ib_wci_send_wc[RDS_IB_WC_MAX]; struct ib_wci_recv_wc[RDS_IB_WC_MAX]; + /* To control the number of wrs from fastreg */ + atomic_ti_fastreg_wrs; + /* interrupt handling */ struct tasklet_struct i_send_tasklet; struct tasklet_struct i_recv_tasklet; diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index 7f68abc..83f4673 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -363,7 +363,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) struct ib_qp_init_attr attr; struct ib_cq_init_attr cq_attr = {}; struct rds_ib_device *rds_ibdev; - int ret; + int ret, fr_queue_space; /* * It's normal to see a null device if an incoming connection races @@ -373,6 +373,12 @@ static int rds_ib_setup_qp(struct rds_connection *conn) if (!rds_ibdev) return -EOPNOTSUPP; + /* The fr_queue_space is currently set to 512, to add extra space on +* completion queue and send queue. This extra space is used for FRMR +* registration and invalidation work requests +*/ + fr_queue_space = (rds_ibdev->use_fastreg ? RDS_IB_DEFAULT_FR_WR : 0); + /* add the conn now so that connection establishment has the dev */ rds_ib_add_conn(rds_ibdev, conn); @@ -384,7 +390,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) /* Protection domain and memory range */ ic->i_pd = rds_ibdev->pd; - cq_attr.cqe = ic->i_send_ring.w_nr + 1; + cq_attr.cqe = ic->i_send_ring.w_nr + fr_queue_space + 1; ic->i_send_cq = ib_create_cq(dev, rds_ib_cq_comp_handler_send, rds_ib_cq_event_handler, conn, @@ -424,7 +430,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) attr.event_handler = rds_ib_qp_event_handler; attr.qp_context = conn; /* + 1 to allow for the single ack message */ - attr.cap.max_send_wr = ic->i_send_ring.w_nr + 1; + attr.cap.max_send_wr = ic->i_send_ring.w_nr + fr_queue_space + 1; attr.cap.max_recv_wr = ic->i_recv_ring.w_nr + 1; attr.cap.max_send_sge = rds_ibdev->max_sge; attr.cap.max_recv_sge = RDS_IB_RECV_SGE; @@ -432,6 +438,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) attr.qp_type = IB_QPT_RC; attr.send_cq = ic->i_send_cq; attr.recv_cq = ic->i_recv_cq; + atomic_set(&ic->i_fastreg_wrs, RDS_IB_DEFAULT_FR_WR); /* * XXX this can fail if max_*_wr is too large? Are we supposed @@ -751,7 +758,8 @@ void rds_ib_conn_shutdown(struct rds_connection *conn) */ wait_event(rds_ib_ring_empty_wait, rds_ib_ring_empty(&ic->i_recv_ring) && - (atomic_read(&ic->i_signaled_sends) == 0)); + (atomic_read(&ic->i_signaled_sends) == 0) && + (atomic_read(&ic->i_fastreg_wrs) == RDS_IB_DEFAULT_FR_WR)); tasklet_kill(&ic->i_send_tasklet); tasklet_kill(&ic->i_recv_tasklet); -- 1.9.1
[net-next][PATCH 08/13] RDS: IB: add connection info to ibmr
Preperatory patch for FRMR support. From connection info, we can retrieve cm_id which contains qp handled needed for work request posting. We also need to drop the RDS connection on QP error states where connection handle becomes useful. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib_mr.h | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h index f5c1fcb..add7725 100644 --- a/net/rds/ib_mr.h +++ b/net/rds/ib_mr.h @@ -50,18 +50,19 @@ struct rds_ib_fmr { /* This is stored as mr->r_trans_private. */ struct rds_ib_mr { - struct rds_ib_device*device; - struct rds_ib_mr_pool *pool; + struct rds_ib_device*device; + struct rds_ib_mr_pool *pool; + struct rds_ib_connection*ic; - struct llist_node llnode; + struct llist_node llnode; /* unmap_list is for freeing */ - struct list_headunmap_list; - unsigned intremap_count; + struct list_headunmap_list; + unsigned intremap_count; - struct scatterlist *sg; - unsigned intsg_len; - int sg_dma_len; + struct scatterlist *sg; + unsigned intsg_len; + int sg_dma_len; union { struct rds_ib_fmr fmr; -- 1.9.1
[net-next][PATCH 07/13] RDS: IB: move FMR code to its own file
No functional change. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib_fmr.c | 126 +- net/rds/ib_mr.h | 6 +++ net/rds/ib_rdma.c | 105 ++--- 3 files changed, 133 insertions(+), 104 deletions(-) diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c index 74f2c21..4fe8f4f 100644 --- a/net/rds/ib_fmr.c +++ b/net/rds/ib_fmr.c @@ -37,61 +37,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev, int npages) struct rds_ib_mr_pool *pool; struct rds_ib_mr *ibmr = NULL; struct rds_ib_fmr *fmr; - int err = 0, iter = 0; + int err = 0; if (npages <= RDS_MR_8K_MSG_SIZE) pool = rds_ibdev->mr_8k_pool; else pool = rds_ibdev->mr_1m_pool; - if (atomic_read(&pool->dirty_count) >= pool->max_items / 10) - queue_delayed_work(rds_ib_mr_wq, &pool->flush_worker, 10); - - /* Switch pools if one of the pool is reaching upper limit */ - if (atomic_read(&pool->dirty_count) >= pool->max_items * 9 / 10) { - if (pool->pool_type == RDS_IB_MR_8K_POOL) - pool = rds_ibdev->mr_1m_pool; - else - pool = rds_ibdev->mr_8k_pool; - } - - while (1) { - ibmr = rds_ib_reuse_mr(pool); - if (ibmr) - return ibmr; - - /* No clean MRs - now we have the choice of either -* allocating a fresh MR up to the limit imposed by the -* driver, or flush any dirty unused MRs. -* We try to avoid stalling in the send path if possible, -* so we allocate as long as we're allowed to. -* -* We're fussy with enforcing the FMR limit, though. If the -* driver tells us we can't use more than N fmrs, we shouldn't -* start arguing with it -*/ - if (atomic_inc_return(&pool->item_count) <= pool->max_items) - break; - - atomic_dec(&pool->item_count); - - if (++iter > 2) { - if (pool->pool_type == RDS_IB_MR_8K_POOL) - rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_depleted); - else - rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_depleted); - return ERR_PTR(-EAGAIN); - } - - /* We do have some empty MRs. Flush them out. */ - if (pool->pool_type == RDS_IB_MR_8K_POOL) - rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_wait); - else - rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_wait); - rds_ib_flush_mr_pool(pool, 0, &ibmr); - if (ibmr) - return ibmr; - } + ibmr = rds_ib_try_reuse_ibmr(pool); + if (ibmr) + return ibmr; ibmr = kzalloc_node(sizeof(*ibmr), GFP_KERNEL, rdsibdev_to_node(rds_ibdev)); @@ -218,3 +173,76 @@ out: return ret; } + +struct rds_ib_mr *rds_ib_reg_fmr(struct rds_ib_device *rds_ibdev, +struct scatterlist *sg, +unsigned long nents, +u32 *key) +{ + struct rds_ib_mr *ibmr = NULL; + struct rds_ib_fmr *fmr; + int ret; + + ibmr = rds_ib_alloc_fmr(rds_ibdev, nents); + if (IS_ERR(ibmr)) + return ibmr; + + ibmr->device = rds_ibdev; + fmr = &ibmr->u.fmr; + ret = rds_ib_map_fmr(rds_ibdev, ibmr, sg, nents); + if (ret == 0) + *key = fmr->fmr->rkey; + else + rds_ib_free_mr(ibmr, 0); + + return ibmr; +} + +void rds_ib_unreg_fmr(struct list_head *list, unsigned int *nfreed, + unsigned long *unpinned, unsigned int goal) +{ + struct rds_ib_mr *ibmr, *next; + struct rds_ib_fmr *fmr; + LIST_HEAD(fmr_list); + int ret = 0; + unsigned int freed = *nfreed; + + /* String all ib_mr's onto one list and hand them to ib_unmap_fmr */ + list_for_each_entry(ibmr, list, unmap_list) { + fmr = &ibmr->u.fmr; + list_add(&fmr->fmr->list, &fmr_list); + } + + ret = ib_unmap_fmr(&fmr_list); + if (ret) + pr_warn("RDS/IB: FMR invalidation failed (err=%d)\n", ret); + + /* Now we can destroy the DMA mapping and unpin any pages */ + list_for_each_entry_safe(ibmr, next, list, unmap_list) { + fmr = &ibmr->u.fmr; + *unpinned += ibmr->sg_len; +
[net-next][PATCH 06/13] RDS: IB: create struct rds_ib_fmr
Keep fmr related filed in its own struct. Fastreg MR structure will be added to the union. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib_fmr.c | 17 ++--- net/rds/ib_mr.h | 11 +-- net/rds/ib_rdma.c | 14 ++ 3 files changed, 29 insertions(+), 13 deletions(-) diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c index d4f200d..74f2c21 100644 --- a/net/rds/ib_fmr.c +++ b/net/rds/ib_fmr.c @@ -36,6 +36,7 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev, int npages) { struct rds_ib_mr_pool *pool; struct rds_ib_mr *ibmr = NULL; + struct rds_ib_fmr *fmr; int err = 0, iter = 0; if (npages <= RDS_MR_8K_MSG_SIZE) @@ -99,15 +100,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev, int npages) goto out_no_cigar; } - ibmr->fmr = ib_alloc_fmr(rds_ibdev->pd, + fmr = &ibmr->u.fmr; + fmr->fmr = ib_alloc_fmr(rds_ibdev->pd, (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ | IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_ATOMIC), &pool->fmr_attr); - if (IS_ERR(ibmr->fmr)) { - err = PTR_ERR(ibmr->fmr); - ibmr->fmr = NULL; + if (IS_ERR(fmr->fmr)) { + err = PTR_ERR(fmr->fmr); + fmr->fmr = NULL; pr_warn("RDS/IB: %s failed (err=%d)\n", __func__, err); goto out_no_cigar; } @@ -122,8 +124,8 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev, int npages) out_no_cigar: if (ibmr) { - if (ibmr->fmr) - ib_dealloc_fmr(ibmr->fmr); + if (fmr->fmr) + ib_dealloc_fmr(fmr->fmr); kfree(ibmr); } atomic_dec(&pool->item_count); @@ -134,6 +136,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct rds_ib_mr *ibmr, struct scatterlist *sg, unsigned int nents) { struct ib_device *dev = rds_ibdev->dev; + struct rds_ib_fmr *fmr = &ibmr->u.fmr; struct scatterlist *scat = sg; u64 io_addr = 0; u64 *dma_pages; @@ -190,7 +193,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct rds_ib_mr *ibmr, (dma_addr & PAGE_MASK) + j; } - ret = ib_map_phys_fmr(ibmr->fmr, dma_pages, page_cnt, io_addr); + ret = ib_map_phys_fmr(fmr->fmr, dma_pages, page_cnt, io_addr); if (ret) goto out; diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h index d88724f..309ad59 100644 --- a/net/rds/ib_mr.h +++ b/net/rds/ib_mr.h @@ -43,11 +43,15 @@ #define RDS_MR_8K_SCALE(256 / (RDS_MR_8K_MSG_SIZE + 1)) #define RDS_MR_8K_POOL_SIZE(RDS_MR_8K_SCALE * (8192 / 2)) +struct rds_ib_fmr { + struct ib_fmr *fmr; + u64 *dma; +}; + /* This is stored as mr->r_trans_private. */ struct rds_ib_mr { struct rds_ib_device*device; struct rds_ib_mr_pool *pool; - struct ib_fmr *fmr; struct llist_node llnode; @@ -57,8 +61,11 @@ struct rds_ib_mr { struct scatterlist *sg; unsigned intsg_len; - u64 *dma; int sg_dma_len; + + union { + struct rds_ib_fmr fmr; + } u; }; /* Our own little MR pool */ diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index c594519..9e608d9 100644 --- a/net/rds/ib_rdma.c +++ b/net/rds/ib_rdma.c @@ -334,6 +334,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, int free_all, struct rds_ib_mr **ibmr_ret) { struct rds_ib_mr *ibmr, *next; + struct rds_ib_fmr *fmr; struct llist_node *clean_nodes; struct llist_node *clean_tail; LIST_HEAD(unmap_list); @@ -395,8 +396,10 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, goto out; /* String all ib_mr's onto one list and hand them to ib_unmap_fmr */ - list_for_each_entry(ibmr, &unmap_list, unmap_list) - list_add(&ibmr->fmr->list, &fmr_list); + list_for_each_entry(ibmr, &unmap_list, unmap_list) { + fmr = &ibmr->u.fmr; + list_add(&fmr->fmr->list, &fmr_list); + } ret = ib_unmap_fmr(&fmr_list); if (ret) @@ -405,6 +408,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, /* Now we can destroy the DMA mapping and unpin any pages */ list_for_each_entry_safe(ibmr, next, &unmap_list, unmap_list) { unpinned += ibmr->sg_len; +
[net-next][PATCH 04/13] RDS: IB: Remove the RDS_IB_SEND_OP dependency
This helps to combine asynchronous fastreg MR completion handler with send completion handler. No functional change. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib.h | 1 - net/rds/ib_cm.c | 42 +++--- net/rds/ib_send.c | 6 ++ 3 files changed, 29 insertions(+), 20 deletions(-) diff --git a/net/rds/ib.h b/net/rds/ib.h index b3fdebb..09cd8e3 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -28,7 +28,6 @@ #define RDS_IB_RECYCLE_BATCH_COUNT 32 #define RDS_IB_WC_MAX 32 -#define RDS_IB_SEND_OP BIT_ULL(63) extern struct rw_semaphore rds_ib_devices_lock; extern struct list_head rds_ib_devices; diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index da5a7fb..7f68abc 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -236,12 +236,10 @@ static void rds_ib_cq_comp_handler_recv(struct ib_cq *cq, void *context) tasklet_schedule(&ic->i_recv_tasklet); } -static void poll_cq(struct rds_ib_connection *ic, struct ib_cq *cq, - struct ib_wc *wcs, - struct rds_ib_ack_state *ack_state) +static void poll_scq(struct rds_ib_connection *ic, struct ib_cq *cq, +struct ib_wc *wcs) { - int nr; - int i; + int nr, i; struct ib_wc *wc; while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) { @@ -251,10 +249,7 @@ static void poll_cq(struct rds_ib_connection *ic, struct ib_cq *cq, (unsigned long long)wc->wr_id, wc->status, wc->byte_len, be32_to_cpu(wc->ex.imm_data)); - if (wc->wr_id & RDS_IB_SEND_OP) - rds_ib_send_cqe_handler(ic, wc); - else - rds_ib_recv_cqe_handler(ic, wc, ack_state); + rds_ib_send_cqe_handler(ic, wc); } } } @@ -263,14 +258,12 @@ static void rds_ib_tasklet_fn_send(unsigned long data) { struct rds_ib_connection *ic = (struct rds_ib_connection *)data; struct rds_connection *conn = ic->conn; - struct rds_ib_ack_state state; rds_ib_stats_inc(s_ib_tasklet_call); - memset(&state, 0, sizeof(state)); - poll_cq(ic, ic->i_send_cq, ic->i_send_wc, &state); + poll_scq(ic, ic->i_send_cq, ic->i_send_wc); ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP); - poll_cq(ic, ic->i_send_cq, ic->i_send_wc, &state); + poll_scq(ic, ic->i_send_cq, ic->i_send_wc); if (rds_conn_up(conn) && (!test_bit(RDS_LL_SEND_FULL, &conn->c_flags) || @@ -278,6 +271,25 @@ static void rds_ib_tasklet_fn_send(unsigned long data) rds_send_xmit(ic->conn); } +static void poll_rcq(struct rds_ib_connection *ic, struct ib_cq *cq, +struct ib_wc *wcs, +struct rds_ib_ack_state *ack_state) +{ + int nr, i; + struct ib_wc *wc; + + while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) { + for (i = 0; i < nr; i++) { + wc = wcs + i; + rdsdebug("wc wr_id 0x%llx status %u byte_len %u imm_data %u\n", +(unsigned long long)wc->wr_id, wc->status, +wc->byte_len, be32_to_cpu(wc->ex.imm_data)); + + rds_ib_recv_cqe_handler(ic, wc, ack_state); + } + } +} + static void rds_ib_tasklet_fn_recv(unsigned long data) { struct rds_ib_connection *ic = (struct rds_ib_connection *)data; @@ -291,9 +303,9 @@ static void rds_ib_tasklet_fn_recv(unsigned long data) rds_ib_stats_inc(s_ib_tasklet_call); memset(&state, 0, sizeof(state)); - poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, &state); + poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, &state); ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED); - poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, &state); + poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, &state); if (state.ack_next_valid) rds_ib_set_ack(ic, state.ack_next, state.ack_required); diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c index eac30bf..f27d2c8 100644 --- a/net/rds/ib_send.c +++ b/net/rds/ib_send.c @@ -195,7 +195,7 @@ void rds_ib_send_init_ring(struct rds_ib_connection *ic) send->s_op = NULL; - send->s_wr.wr_id = i | RDS_IB_SEND_OP; + send->s_wr.wr_id = i; send->s_wr.sg_list = send->s_sge; send->s_wr.ex.imm_data = 0; @@ -263,9 +263,7 @@ void rds_ib_send_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc) oldest = rds_ib_ring_oldest(&ic->i_send_r
Re: [net-next][PATCH 11/13] RDS: IB: add Fastreg MR (FRMR) detection support
On 2/22/2016 7:38 AM, Bart Van Assche wrote: On 02/21/16 19:36, David Miller wrote: From: Santosh Shilimkar Date: Sat, 20 Feb 2016 03:30:02 -0800 @@ -54,6 +55,8 @@ module_param(rds_ib_mr_8k_pool_size, int, 0444); MODULE_PARM_DESC(rds_ib_mr_8k_pool_size, " Max number of 8K mr per HCA"); module_param(rds_ib_retry_count, int, 0444); MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before reporting an error"); +module_param(prefer_frmr, bool, 0444); +MODULE_PARM_DESC(prefer_frmr, "Preferred MR method if both FMR and FRMR supported"); Sorry, you're going to have to create a real run time method to configure this parameter. I'm strongly against module parameters. Please don't go into details about why this might be difficult to do, I'm totally not interested. Doing things properly is sometimes not easy, that's life. Sure Dave. Will drop the parameter. The runtime detection is already in place. When an HCA hardware supports both FMR and FRMR features, parameter can be used as over-ride over a default selection. Hello Santosh, What is the purpose of the prefer_frmr kernel module parameter ? Is this a parameter that is useful to RDS users or is its only purpose to allow developers of the RDS module to test both the FMR and FRMR code paths on hardware that supports both MR methods ? Right. Since FRMR in early phase still for RDS, it was useful on HCA's which supports both registration methods. Its not a deal breaker so am going to drop the parameter as mentioned above. Regards, Santosh
[net-next][PATCH 01/13] RDS: Drop stale iWARP RDMA transport
RDS iWarp support code has become stale and non testable. As indicated earlier, am dropping the support for it. If new iWarp user(s) shows up in future, we can adapat the RDS IB transprt for the special RDMA READ sink case. iWarp needs an MR for the RDMA READ sink. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- Documentation/networking/rds.txt | 4 +- net/rds/Kconfig | 7 +- net/rds/Makefile | 4 +- net/rds/iw.c | 312 - net/rds/iw.h | 398 net/rds/iw_cm.c | 769 -- net/rds/iw_rdma.c| 837 - net/rds/iw_recv.c| 904 net/rds/iw_ring.c| 169 --- net/rds/iw_send.c| 981 --- net/rds/iw_stats.c | 95 net/rds/iw_sysctl.c | 123 - net/rds/rdma_transport.c | 13 +- net/rds/rdma_transport.h | 5 - 14 files changed, 7 insertions(+), 4614 deletions(-) delete mode 100644 net/rds/iw.c delete mode 100644 net/rds/iw.h delete mode 100644 net/rds/iw_cm.c delete mode 100644 net/rds/iw_rdma.c delete mode 100644 net/rds/iw_recv.c delete mode 100644 net/rds/iw_ring.c delete mode 100644 net/rds/iw_send.c delete mode 100644 net/rds/iw_stats.c delete mode 100644 net/rds/iw_sysctl.c diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt index e1a3d59..9d219d8 100644 --- a/Documentation/networking/rds.txt +++ b/Documentation/networking/rds.txt @@ -19,9 +19,7 @@ to N*N if you use a connection-oriented socket transport like TCP. RDS is not Infiniband-specific; it was designed to support different transports. The current implementation used to support RDS over TCP as well -as IB. Work is in progress to support RDS over iWARP, and using DCE to -guarantee no dropped packets on Ethernet, it may be possible to use RDS over -UDP in the future. +as IB. The high-level semantics of RDS from the application's point of view are diff --git a/net/rds/Kconfig b/net/rds/Kconfig index f2c670b..bffde4b 100644 --- a/net/rds/Kconfig +++ b/net/rds/Kconfig @@ -4,14 +4,13 @@ config RDS depends on INET ---help--- The RDS (Reliable Datagram Sockets) protocol provides reliable, - sequenced delivery of datagrams over Infiniband, iWARP, - or TCP. + sequenced delivery of datagrams over Infiniband or TCP. config RDS_RDMA - tristate "RDS over Infiniband and iWARP" + tristate "RDS over Infiniband" depends on RDS && INFINIBAND && INFINIBAND_ADDR_TRANS ---help--- - Allow RDS to use Infiniband and iWARP as a transport. + Allow RDS to use Infiniband as a transport. This transport supports RDMA operations. config RDS_TCP diff --git a/net/rds/Makefile b/net/rds/Makefile index 56d3f60..19e5485 100644 --- a/net/rds/Makefile +++ b/net/rds/Makefile @@ -6,9 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o message.o \ obj-$(CONFIG_RDS_RDMA) += rds_rdma.o rds_rdma-y := rdma_transport.o \ ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \ - ib_sysctl.o ib_rdma.o \ - iw.o iw_cm.o iw_recv.o iw_ring.o iw_send.o iw_stats.o \ - iw_sysctl.o iw_rdma.o + ib_sysctl.o ib_rdma.o obj-$(CONFIG_RDS_TCP) += rds_tcp.o diff --git a/net/rds/iw.c b/net/rds/iw.c deleted file mode 100644 index f4a9fff..000 diff --git a/net/rds/iw.h b/net/rds/iw.h deleted file mode 100644 index 5af01d1..000 diff --git a/net/rds/iw_cm.c b/net/rds/iw_cm.c deleted file mode 100644 index aea4c91..000 diff --git a/net/rds/iw_rdma.c b/net/rds/iw_rdma.c deleted file mode 100644 index b09a40c..000 diff --git a/net/rds/iw_recv.c b/net/rds/iw_recv.c deleted file mode 100644 index a66d179..000 diff --git a/net/rds/iw_ring.c b/net/rds/iw_ring.c deleted file mode 100644 index da8e3b6..000 diff --git a/net/rds/iw_send.c b/net/rds/iw_send.c deleted file mode 100644 index e20bd50..000 diff --git a/net/rds/iw_stats.c b/net/rds/iw_stats.c deleted file mode 100644 index 5fe67f6..000 diff --git a/net/rds/iw_sysctl.c b/net/rds/iw_sysctl.c deleted file mode 100644 index 139239d..000 diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c index 9c1fed8..4f4b3d8 100644 --- a/net/rds/rdma_transport.c +++ b/net/rds/rdma_transport.c @@ -49,9 +49,7 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id, rdsdebug("conn %p id %p handling event %u (%s)\n", conn, cm_id, event->event, rdma_event_msg(event->event)); - if (cm_id->device->node_type == RDMA_NODE_RNIC) - trans = &
[net-next][PATCH 02/13] RDS: Add support for SO_TIMESTAMP for incoming messages
The SO_TIMESTAMP generates time stamp for each incoming RDS messages User app can enable it by using SO_TIMESTAMP setsocketopt() at SOL_SOCKET level. CMSG data of cmsg type SO_TIMESTAMP contains the time stamp in struct timeval format. Reviewed-by: Sowmini Varadhan Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/af_rds.c | 26 ++ net/rds/rds.h| 1 + net/rds/recv.c | 20 ++-- 3 files changed, 45 insertions(+), 2 deletions(-) diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c index b5476aeb..6beaeb1 100644 --- a/net/rds/af_rds.c +++ b/net/rds/af_rds.c @@ -277,6 +277,27 @@ static int rds_set_transport(struct rds_sock *rs, char __user *optval, return rs->rs_transport ? 0 : -ENOPROTOOPT; } +static int rds_enable_recvtstamp(struct sock *sk, char __user *optval, +int optlen) +{ + int val, valbool; + + if (optlen != sizeof(int)) + return -EFAULT; + + if (get_user(val, (int __user *)optval)) + return -EFAULT; + + valbool = val ? 1 : 0; + + if (valbool) + sock_set_flag(sk, SOCK_RCVTSTAMP); + else + sock_reset_flag(sk, SOCK_RCVTSTAMP); + + return 0; +} + static int rds_setsockopt(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen) { @@ -312,6 +333,11 @@ static int rds_setsockopt(struct socket *sock, int level, int optname, ret = rds_set_transport(rs, optval, optlen); release_sock(sock->sk); break; + case SO_TIMESTAMP: + lock_sock(sock->sk); + ret = rds_enable_recvtstamp(sock->sk, optval, optlen); + release_sock(sock->sk); + break; default: ret = -ENOPROTOOPT; } diff --git a/net/rds/rds.h b/net/rds/rds.h index 0e2797b..80256b0 100644 --- a/net/rds/rds.h +++ b/net/rds/rds.h @@ -222,6 +222,7 @@ struct rds_incoming { __be32 i_saddr; rds_rdma_cookie_t i_rdma_cookie; + struct timeval i_rx_tstamp; }; struct rds_mr { diff --git a/net/rds/recv.c b/net/rds/recv.c index a00462b..c0be1ec 100644 --- a/net/rds/recv.c +++ b/net/rds/recv.c @@ -35,6 +35,8 @@ #include #include #include +#include +#include #include "rds.h" @@ -46,6 +48,8 @@ void rds_inc_init(struct rds_incoming *inc, struct rds_connection *conn, inc->i_conn = conn; inc->i_saddr = saddr; inc->i_rdma_cookie = 0; + inc->i_rx_tstamp.tv_sec = 0; + inc->i_rx_tstamp.tv_usec = 0; } EXPORT_SYMBOL_GPL(rds_inc_init); @@ -228,6 +232,8 @@ void rds_recv_incoming(struct rds_connection *conn, __be32 saddr, __be32 daddr, rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong, be32_to_cpu(inc->i_hdr.h_len), inc->i_hdr.h_dport); + if (sock_flag(sk, SOCK_RCVTSTAMP)) + do_gettimeofday(&inc->i_rx_tstamp); rds_inc_addref(inc); list_add_tail(&inc->i_item, &rs->rs_recv_queue); __rds_wake_sk_sleep(sk); @@ -381,7 +387,8 @@ static int rds_notify_cong(struct rds_sock *rs, struct msghdr *msghdr) /* * Receive any control messages. */ -static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg) +static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg, +struct rds_sock *rs) { int ret = 0; @@ -392,6 +399,15 @@ static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg) return ret; } + if ((inc->i_rx_tstamp.tv_sec != 0) && + sock_flag(rds_rs_to_sk(rs), SOCK_RCVTSTAMP)) { + ret = put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMP, + sizeof(struct timeval), + &inc->i_rx_tstamp); + if (ret) + return ret; + } + return 0; } @@ -474,7 +490,7 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, msg->msg_flags |= MSG_TRUNC; } - if (rds_cmsg_recv(inc, msg)) { + if (rds_cmsg_recv(inc, msg, rs)) { ret = -EFAULT; goto out; } -- 1.9.1
[net-next][PATCH 05/13] RDS: IB: Re-organise ibmr code
No functional changes. This is in preperation towards adding fastreg memory resgitration support. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/Makefile | 2 +- net/rds/ib.c | 37 +++--- net/rds/ib.h | 25 +--- net/rds/ib_fmr.c | 217 +++ net/rds/ib_mr.h | 109 net/rds/ib_rdma.c | 379 +++--- 6 files changed, 422 insertions(+), 347 deletions(-) create mode 100644 net/rds/ib_fmr.c create mode 100644 net/rds/ib_mr.h diff --git a/net/rds/Makefile b/net/rds/Makefile index 19e5485..bcf5591 100644 --- a/net/rds/Makefile +++ b/net/rds/Makefile @@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o message.o \ obj-$(CONFIG_RDS_RDMA) += rds_rdma.o rds_rdma-y := rdma_transport.o \ ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \ - ib_sysctl.o ib_rdma.o + ib_sysctl.o ib_rdma.o ib_fmr.o obj-$(CONFIG_RDS_TCP) += rds_tcp.o diff --git a/net/rds/ib.c b/net/rds/ib.c index 9481d55..bb32cb9 100644 --- a/net/rds/ib.c +++ b/net/rds/ib.c @@ -42,15 +42,16 @@ #include "rds.h" #include "ib.h" +#include "ib_mr.h" -unsigned int rds_ib_fmr_1m_pool_size = RDS_FMR_1M_POOL_SIZE; -unsigned int rds_ib_fmr_8k_pool_size = RDS_FMR_8K_POOL_SIZE; +unsigned int rds_ib_mr_1m_pool_size = RDS_MR_1M_POOL_SIZE; +unsigned int rds_ib_mr_8k_pool_size = RDS_MR_8K_POOL_SIZE; unsigned int rds_ib_retry_count = RDS_IB_DEFAULT_RETRY_COUNT; -module_param(rds_ib_fmr_1m_pool_size, int, 0444); -MODULE_PARM_DESC(rds_ib_fmr_1m_pool_size, " Max number of 1M fmr per HCA"); -module_param(rds_ib_fmr_8k_pool_size, int, 0444); -MODULE_PARM_DESC(rds_ib_fmr_8k_pool_size, " Max number of 8K fmr per HCA"); +module_param(rds_ib_mr_1m_pool_size, int, 0444); +MODULE_PARM_DESC(rds_ib_mr_1m_pool_size, " Max number of 1M mr per HCA"); +module_param(rds_ib_mr_8k_pool_size, int, 0444); +MODULE_PARM_DESC(rds_ib_mr_8k_pool_size, " Max number of 8K mr per HCA"); module_param(rds_ib_retry_count, int, 0444); MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before reporting an error"); @@ -140,13 +141,13 @@ static void rds_ib_add_one(struct ib_device *device) rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE); rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32; - rds_ibdev->max_1m_fmrs = device->attrs.max_mr ? + rds_ibdev->max_1m_mrs = device->attrs.max_mr ? min_t(unsigned int, (device->attrs.max_mr / 2), - rds_ib_fmr_1m_pool_size) : rds_ib_fmr_1m_pool_size; + rds_ib_mr_1m_pool_size) : rds_ib_mr_1m_pool_size; - rds_ibdev->max_8k_fmrs = device->attrs.max_mr ? + rds_ibdev->max_8k_mrs = device->attrs.max_mr ? min_t(unsigned int, ((device->attrs.max_mr / 2) * RDS_MR_8K_SCALE), - rds_ib_fmr_8k_pool_size) : rds_ib_fmr_8k_pool_size; + rds_ib_mr_8k_pool_size) : rds_ib_mr_8k_pool_size; rds_ibdev->max_initiator_depth = device->attrs.max_qp_init_rd_atom; rds_ibdev->max_responder_resources = device->attrs.max_qp_rd_atom; @@ -172,10 +173,10 @@ static void rds_ib_add_one(struct ib_device *device) goto put_dev; } - rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, fmr_max_remaps = %d, max_1m_fmrs = %d, max_8k_fmrs = %d\n", + rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, fmr_max_remaps = %d, max_1m_mrs = %d, max_8k_mrs = %d\n", device->attrs.max_fmr, rds_ibdev->max_wrs, rds_ibdev->max_sge, -rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_fmrs, -rds_ibdev->max_8k_fmrs); +rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs, +rds_ibdev->max_8k_mrs); INIT_LIST_HEAD(&rds_ibdev->ipaddr_list); INIT_LIST_HEAD(&rds_ibdev->conn_list); @@ -364,7 +365,7 @@ void rds_ib_exit(void) rds_ib_sysctl_exit(); rds_ib_recv_exit(); rds_trans_unregister(&rds_ib_transport); - rds_ib_fmr_exit(); + rds_ib_mr_exit(); } struct rds_transport rds_ib_transport = { @@ -400,13 +401,13 @@ int rds_ib_init(void) INIT_LIST_HEAD(&rds_ib_devices); - ret = rds_ib_fmr_init(); + ret = rds_ib_mr_init(); if (ret) goto out; ret = ib_register_client(&rds_ib_client); if (ret) - goto out_fmr_exit; + goto out_mr_exit; ret = rds_ib_sysctl_init(); if (ret) @@ -430,8 +431,8 @@ out_sysctl: rds_ib_sysctl_exit(); out_ibreg: rds_ib_unregister_client(); -out_fmr_exi
[net-next][PATCH 03/13] MAINTAINERS: update RDS entry
Acked-by: Chien Yen Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- MAINTAINERS | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 355e1c8..9d79bea 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9081,10 +9081,14 @@ S: Maintained F: drivers/net/ethernet/rdc/r6040.c RDS - RELIABLE DATAGRAM SOCKETS -M: Chien Yen +M: Santosh Shilimkar +L: net...@vger.kernel.org +L: linux-r...@vger.kernel.org L: rds-de...@oss.oracle.com (moderated for non-subscribers) +W: https://oss.oracle.com/projects/rds/ S: Supported F: net/rds/ +F: Documentation/networking/rds.txt READ-COPY UPDATE (RCU) M: "Paul E. McKenney" -- 1.9.1
[net-next][PATCH 12/13] RDS: IB: allocate extra space on queues for FRMR support
Fastreg MR(FRMR) memory registration and invalidation makes use of work request and completion queues for its operation. Patch allocates extra queue space towards these operation(s). Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib.h| 4 net/rds/ib_cm.c | 16 2 files changed, 16 insertions(+), 4 deletions(-) diff --git a/net/rds/ib.h b/net/rds/ib.h index c5eddc2..eeb0d6c 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -14,6 +14,7 @@ #define RDS_IB_DEFAULT_RECV_WR 1024 #define RDS_IB_DEFAULT_SEND_WR 256 +#define RDS_IB_DEFAULT_FR_WR 512 #define RDS_IB_DEFAULT_RETRY_COUNT 2 @@ -122,6 +123,9 @@ struct rds_ib_connection { struct ib_wci_send_wc[RDS_IB_WC_MAX]; struct ib_wci_recv_wc[RDS_IB_WC_MAX]; + /* To control the number of wrs from fastreg */ + atomic_ti_fastreg_wrs; + /* interrupt handling */ struct tasklet_struct i_send_tasklet; struct tasklet_struct i_recv_tasklet; diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index 7f68abc..83f4673 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -363,7 +363,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) struct ib_qp_init_attr attr; struct ib_cq_init_attr cq_attr = {}; struct rds_ib_device *rds_ibdev; - int ret; + int ret, fr_queue_space; /* * It's normal to see a null device if an incoming connection races @@ -373,6 +373,12 @@ static int rds_ib_setup_qp(struct rds_connection *conn) if (!rds_ibdev) return -EOPNOTSUPP; + /* The fr_queue_space is currently set to 512, to add extra space on +* completion queue and send queue. This extra space is used for FRMR +* registration and invalidation work requests +*/ + fr_queue_space = (rds_ibdev->use_fastreg ? RDS_IB_DEFAULT_FR_WR : 0); + /* add the conn now so that connection establishment has the dev */ rds_ib_add_conn(rds_ibdev, conn); @@ -384,7 +390,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) /* Protection domain and memory range */ ic->i_pd = rds_ibdev->pd; - cq_attr.cqe = ic->i_send_ring.w_nr + 1; + cq_attr.cqe = ic->i_send_ring.w_nr + fr_queue_space + 1; ic->i_send_cq = ib_create_cq(dev, rds_ib_cq_comp_handler_send, rds_ib_cq_event_handler, conn, @@ -424,7 +430,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) attr.event_handler = rds_ib_qp_event_handler; attr.qp_context = conn; /* + 1 to allow for the single ack message */ - attr.cap.max_send_wr = ic->i_send_ring.w_nr + 1; + attr.cap.max_send_wr = ic->i_send_ring.w_nr + fr_queue_space + 1; attr.cap.max_recv_wr = ic->i_recv_ring.w_nr + 1; attr.cap.max_send_sge = rds_ibdev->max_sge; attr.cap.max_recv_sge = RDS_IB_RECV_SGE; @@ -432,6 +438,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) attr.qp_type = IB_QPT_RC; attr.send_cq = ic->i_send_cq; attr.recv_cq = ic->i_recv_cq; + atomic_set(&ic->i_fastreg_wrs, RDS_IB_DEFAULT_FR_WR); /* * XXX this can fail if max_*_wr is too large? Are we supposed @@ -751,7 +758,8 @@ void rds_ib_conn_shutdown(struct rds_connection *conn) */ wait_event(rds_ib_ring_empty_wait, rds_ib_ring_empty(&ic->i_recv_ring) && - (atomic_read(&ic->i_signaled_sends) == 0)); + (atomic_read(&ic->i_signaled_sends) == 0) && + (atomic_read(&ic->i_fastreg_wrs) == RDS_IB_DEFAULT_FR_WR)); tasklet_kill(&ic->i_send_tasklet); tasklet_kill(&ic->i_recv_tasklet); -- 1.9.1
[net-next][PATCH 09/13] RDS: IB: handle the RDMA CM time wait event
Drop the RDS connection on RDMA_CM_EVENT_TIMEWAIT_EXIT so that it can reconnect and resume. While testing fastreg, this error happened in couple of tests but was getting un-noticed. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/rdma_transport.c | 8 1 file changed, 8 insertions(+) diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c index 4f4b3d8..7220beb 100644 --- a/net/rds/rdma_transport.c +++ b/net/rds/rdma_transport.c @@ -117,6 +117,14 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id, rds_conn_drop(conn); break; + case RDMA_CM_EVENT_TIMEWAIT_EXIT: + if (conn) { + pr_info("RDS: RDMA_CM_EVENT_TIMEWAIT_EXIT event: dropping connection %pI4->%pI4\n", + &conn->c_laddr, &conn->c_faddr); + rds_conn_drop(conn); + } + break; + default: /* things like device disconnect? */ printk(KERN_ERR "RDS: unknown event %u (%s)!\n", -- 1.9.1
[net-next][PATCH 04/13] RDS: IB: Remove the RDS_IB_SEND_OP dependency
This helps to combine asynchronous fastreg MR completion handler with send completion handler. No functional change. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib.h | 1 - net/rds/ib_cm.c | 42 +++--- net/rds/ib_send.c | 6 ++ 3 files changed, 29 insertions(+), 20 deletions(-) diff --git a/net/rds/ib.h b/net/rds/ib.h index b3fdebb..09cd8e3 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -28,7 +28,6 @@ #define RDS_IB_RECYCLE_BATCH_COUNT 32 #define RDS_IB_WC_MAX 32 -#define RDS_IB_SEND_OP BIT_ULL(63) extern struct rw_semaphore rds_ib_devices_lock; extern struct list_head rds_ib_devices; diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index da5a7fb..7f68abc 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -236,12 +236,10 @@ static void rds_ib_cq_comp_handler_recv(struct ib_cq *cq, void *context) tasklet_schedule(&ic->i_recv_tasklet); } -static void poll_cq(struct rds_ib_connection *ic, struct ib_cq *cq, - struct ib_wc *wcs, - struct rds_ib_ack_state *ack_state) +static void poll_scq(struct rds_ib_connection *ic, struct ib_cq *cq, +struct ib_wc *wcs) { - int nr; - int i; + int nr, i; struct ib_wc *wc; while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) { @@ -251,10 +249,7 @@ static void poll_cq(struct rds_ib_connection *ic, struct ib_cq *cq, (unsigned long long)wc->wr_id, wc->status, wc->byte_len, be32_to_cpu(wc->ex.imm_data)); - if (wc->wr_id & RDS_IB_SEND_OP) - rds_ib_send_cqe_handler(ic, wc); - else - rds_ib_recv_cqe_handler(ic, wc, ack_state); + rds_ib_send_cqe_handler(ic, wc); } } } @@ -263,14 +258,12 @@ static void rds_ib_tasklet_fn_send(unsigned long data) { struct rds_ib_connection *ic = (struct rds_ib_connection *)data; struct rds_connection *conn = ic->conn; - struct rds_ib_ack_state state; rds_ib_stats_inc(s_ib_tasklet_call); - memset(&state, 0, sizeof(state)); - poll_cq(ic, ic->i_send_cq, ic->i_send_wc, &state); + poll_scq(ic, ic->i_send_cq, ic->i_send_wc); ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP); - poll_cq(ic, ic->i_send_cq, ic->i_send_wc, &state); + poll_scq(ic, ic->i_send_cq, ic->i_send_wc); if (rds_conn_up(conn) && (!test_bit(RDS_LL_SEND_FULL, &conn->c_flags) || @@ -278,6 +271,25 @@ static void rds_ib_tasklet_fn_send(unsigned long data) rds_send_xmit(ic->conn); } +static void poll_rcq(struct rds_ib_connection *ic, struct ib_cq *cq, +struct ib_wc *wcs, +struct rds_ib_ack_state *ack_state) +{ + int nr, i; + struct ib_wc *wc; + + while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) { + for (i = 0; i < nr; i++) { + wc = wcs + i; + rdsdebug("wc wr_id 0x%llx status %u byte_len %u imm_data %u\n", +(unsigned long long)wc->wr_id, wc->status, +wc->byte_len, be32_to_cpu(wc->ex.imm_data)); + + rds_ib_recv_cqe_handler(ic, wc, ack_state); + } + } +} + static void rds_ib_tasklet_fn_recv(unsigned long data) { struct rds_ib_connection *ic = (struct rds_ib_connection *)data; @@ -291,9 +303,9 @@ static void rds_ib_tasklet_fn_recv(unsigned long data) rds_ib_stats_inc(s_ib_tasklet_call); memset(&state, 0, sizeof(state)); - poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, &state); + poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, &state); ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED); - poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, &state); + poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, &state); if (state.ack_next_valid) rds_ib_set_ack(ic, state.ack_next, state.ack_required); diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c index eac30bf..f27d2c8 100644 --- a/net/rds/ib_send.c +++ b/net/rds/ib_send.c @@ -195,7 +195,7 @@ void rds_ib_send_init_ring(struct rds_ib_connection *ic) send->s_op = NULL; - send->s_wr.wr_id = i | RDS_IB_SEND_OP; + send->s_wr.wr_id = i; send->s_wr.sg_list = send->s_sge; send->s_wr.ex.imm_data = 0; @@ -263,9 +263,7 @@ void rds_ib_send_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc) oldest = rds_ib_ring_oldest(&ic->i_send_r
[net-next][PATCH 07/13] RDS: IB: move FMR code to its own file
No functional change. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib_fmr.c | 126 +- net/rds/ib_mr.h | 6 +++ net/rds/ib_rdma.c | 105 ++--- 3 files changed, 133 insertions(+), 104 deletions(-) diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c index 74f2c21..4fe8f4f 100644 --- a/net/rds/ib_fmr.c +++ b/net/rds/ib_fmr.c @@ -37,61 +37,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev, int npages) struct rds_ib_mr_pool *pool; struct rds_ib_mr *ibmr = NULL; struct rds_ib_fmr *fmr; - int err = 0, iter = 0; + int err = 0; if (npages <= RDS_MR_8K_MSG_SIZE) pool = rds_ibdev->mr_8k_pool; else pool = rds_ibdev->mr_1m_pool; - if (atomic_read(&pool->dirty_count) >= pool->max_items / 10) - queue_delayed_work(rds_ib_mr_wq, &pool->flush_worker, 10); - - /* Switch pools if one of the pool is reaching upper limit */ - if (atomic_read(&pool->dirty_count) >= pool->max_items * 9 / 10) { - if (pool->pool_type == RDS_IB_MR_8K_POOL) - pool = rds_ibdev->mr_1m_pool; - else - pool = rds_ibdev->mr_8k_pool; - } - - while (1) { - ibmr = rds_ib_reuse_mr(pool); - if (ibmr) - return ibmr; - - /* No clean MRs - now we have the choice of either -* allocating a fresh MR up to the limit imposed by the -* driver, or flush any dirty unused MRs. -* We try to avoid stalling in the send path if possible, -* so we allocate as long as we're allowed to. -* -* We're fussy with enforcing the FMR limit, though. If the -* driver tells us we can't use more than N fmrs, we shouldn't -* start arguing with it -*/ - if (atomic_inc_return(&pool->item_count) <= pool->max_items) - break; - - atomic_dec(&pool->item_count); - - if (++iter > 2) { - if (pool->pool_type == RDS_IB_MR_8K_POOL) - rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_depleted); - else - rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_depleted); - return ERR_PTR(-EAGAIN); - } - - /* We do have some empty MRs. Flush them out. */ - if (pool->pool_type == RDS_IB_MR_8K_POOL) - rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_wait); - else - rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_wait); - rds_ib_flush_mr_pool(pool, 0, &ibmr); - if (ibmr) - return ibmr; - } + ibmr = rds_ib_try_reuse_ibmr(pool); + if (ibmr) + return ibmr; ibmr = kzalloc_node(sizeof(*ibmr), GFP_KERNEL, rdsibdev_to_node(rds_ibdev)); @@ -218,3 +173,76 @@ out: return ret; } + +struct rds_ib_mr *rds_ib_reg_fmr(struct rds_ib_device *rds_ibdev, +struct scatterlist *sg, +unsigned long nents, +u32 *key) +{ + struct rds_ib_mr *ibmr = NULL; + struct rds_ib_fmr *fmr; + int ret; + + ibmr = rds_ib_alloc_fmr(rds_ibdev, nents); + if (IS_ERR(ibmr)) + return ibmr; + + ibmr->device = rds_ibdev; + fmr = &ibmr->u.fmr; + ret = rds_ib_map_fmr(rds_ibdev, ibmr, sg, nents); + if (ret == 0) + *key = fmr->fmr->rkey; + else + rds_ib_free_mr(ibmr, 0); + + return ibmr; +} + +void rds_ib_unreg_fmr(struct list_head *list, unsigned int *nfreed, + unsigned long *unpinned, unsigned int goal) +{ + struct rds_ib_mr *ibmr, *next; + struct rds_ib_fmr *fmr; + LIST_HEAD(fmr_list); + int ret = 0; + unsigned int freed = *nfreed; + + /* String all ib_mr's onto one list and hand them to ib_unmap_fmr */ + list_for_each_entry(ibmr, list, unmap_list) { + fmr = &ibmr->u.fmr; + list_add(&fmr->fmr->list, &fmr_list); + } + + ret = ib_unmap_fmr(&fmr_list); + if (ret) + pr_warn("RDS/IB: FMR invalidation failed (err=%d)\n", ret); + + /* Now we can destroy the DMA mapping and unpin any pages */ + list_for_each_entry_safe(ibmr, next, list, unmap_list) { + fmr = &ibmr->u.fmr; + *unpinned += ibmr->sg_len; +
[net-next][PATCH 11/13] RDS: IB: add Fastreg MR (FRMR) detection support
Discovere Fast Memmory Registration support using IB device IB_DEVICE_MEM_MGT_EXTENSIONS. Certain HCA might support just FRMR or FMR or both FMR and FRWR. In case both mr type are supported, default FMR is used. Using module parameter 'prefer_frmr', user can choose its preferred MR method for RDS. Ofcourse the module parameter has no effect if the HCA support only FRMR or only FRMR. Default MR is still kept as FMR against what everyone else is following. Default will be changed to FRMR once the RDS performance with FRMR is comparable with FMR. The work is in progress for the same. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib.c| 14 ++ net/rds/ib.h| 4 net/rds/ib_mr.h | 1 + 3 files changed, 19 insertions(+) diff --git a/net/rds/ib.c b/net/rds/ib.c index bb32cb9..68c94b0 100644 --- a/net/rds/ib.c +++ b/net/rds/ib.c @@ -47,6 +47,7 @@ unsigned int rds_ib_mr_1m_pool_size = RDS_MR_1M_POOL_SIZE; unsigned int rds_ib_mr_8k_pool_size = RDS_MR_8K_POOL_SIZE; unsigned int rds_ib_retry_count = RDS_IB_DEFAULT_RETRY_COUNT; +bool prefer_frmr; module_param(rds_ib_mr_1m_pool_size, int, 0444); MODULE_PARM_DESC(rds_ib_mr_1m_pool_size, " Max number of 1M mr per HCA"); @@ -54,6 +55,8 @@ module_param(rds_ib_mr_8k_pool_size, int, 0444); MODULE_PARM_DESC(rds_ib_mr_8k_pool_size, " Max number of 8K mr per HCA"); module_param(rds_ib_retry_count, int, 0444); MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before reporting an error"); +module_param(prefer_frmr, bool, 0444); +MODULE_PARM_DESC(prefer_frmr, "Preferred MR method if both FMR and FRMR supported"); /* * we have a clumsy combination of RCU and a rwsem protecting this list @@ -140,6 +143,13 @@ static void rds_ib_add_one(struct ib_device *device) rds_ibdev->max_wrs = device->attrs.max_qp_wr; rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE); + rds_ibdev->has_fr = (device->attrs.device_cap_flags & + IB_DEVICE_MEM_MGT_EXTENSIONS); + rds_ibdev->has_fmr = (device->alloc_fmr && device->dealloc_fmr && + device->map_phys_fmr && device->unmap_fmr); + rds_ibdev->use_fastreg = (rds_ibdev->has_fr && +(!rds_ibdev->has_fmr || prefer_frmr)); + rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32; rds_ibdev->max_1m_mrs = device->attrs.max_mr ? min_t(unsigned int, (device->attrs.max_mr / 2), @@ -178,6 +188,10 @@ static void rds_ib_add_one(struct ib_device *device) rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs, rds_ibdev->max_8k_mrs); + pr_info("RDS/IB: %s: %s supported and preferred\n", + device->name, + rds_ibdev->use_fastreg ? "FRMR" : "FMR"); + INIT_LIST_HEAD(&rds_ibdev->ipaddr_list); INIT_LIST_HEAD(&rds_ibdev->conn_list); diff --git a/net/rds/ib.h b/net/rds/ib.h index 62fe7d5..c5eddc2 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -200,6 +200,10 @@ struct rds_ib_device { struct list_headconn_list; struct ib_device*dev; struct ib_pd*pd; + boolhas_fmr; + boolhas_fr; + booluse_fastreg; + unsigned intmax_mrs; struct rds_ib_mr_pool *mr_1m_pool; struct rds_ib_mr_pool *mr_8k_pool; diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h index add7725..2f9b9c3 100644 --- a/net/rds/ib_mr.h +++ b/net/rds/ib_mr.h @@ -93,6 +93,7 @@ struct rds_ib_mr_pool { extern struct workqueue_struct *rds_ib_mr_wq; extern unsigned int rds_ib_mr_1m_pool_size; extern unsigned int rds_ib_mr_8k_pool_size; +extern bool prefer_frmr; struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *rds_dev, int npages); -- 1.9.1
[net-next][PATCH 13/13] RDS: IB: Support Fastreg MR (FRMR) memory registration mode
From: Avinash Repaka Fastreg MR(FRMR) is another method with which one can register memory to HCA. Some of the newer HCAs supports only fastreg mr mode, so we need to add support for it to RDS to have RDS functional on them. Some of the older HCAs support both FMR and FRMR modes. So to try out FRMR on older HCAs, one can use module parameter 'prefer_frmr' Signed-off-by: Santosh Shilimkar Signed-off-by: Avinash Repaka Signed-off-by: Santosh Shilimkar --- RDS IB RDMA performance with FRMR is not yet as good as FMR and I do have some patches in progress to address that. But they are not ready for 4.6 so I left them out of this series. net/rds/Makefile | 2 +- net/rds/ib.h | 1 + net/rds/ib_cm.c | 7 +- net/rds/ib_frmr.c | 376 ++ net/rds/ib_mr.h | 24 net/rds/ib_rdma.c | 17 ++- 6 files changed, 422 insertions(+), 5 deletions(-) create mode 100644 net/rds/ib_frmr.c diff --git a/net/rds/Makefile b/net/rds/Makefile index bcf5591..0e72bec 100644 --- a/net/rds/Makefile +++ b/net/rds/Makefile @@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o message.o \ obj-$(CONFIG_RDS_RDMA) += rds_rdma.o rds_rdma-y := rdma_transport.o \ ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \ - ib_sysctl.o ib_rdma.o ib_fmr.o + ib_sysctl.o ib_rdma.o ib_fmr.o ib_frmr.o obj-$(CONFIG_RDS_TCP) += rds_tcp.o diff --git a/net/rds/ib.h b/net/rds/ib.h index eeb0d6c..627fb79 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -349,6 +349,7 @@ int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr); void rds_ib_add_conn(struct rds_ib_device *rds_ibdev, struct rds_connection *conn); void rds_ib_remove_conn(struct rds_ib_device *rds_ibdev, struct rds_connection *conn); void rds_ib_destroy_nodev_conns(void); +void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc); /* ib_recv.c */ int rds_ib_recv_init(void); diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index 83f4673..8764970 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -249,7 +249,12 @@ static void poll_scq(struct rds_ib_connection *ic, struct ib_cq *cq, (unsigned long long)wc->wr_id, wc->status, wc->byte_len, be32_to_cpu(wc->ex.imm_data)); - rds_ib_send_cqe_handler(ic, wc); + if (wc->wr_id <= ic->i_send_ring.w_nr || + wc->wr_id == RDS_IB_ACK_WR_ID) + rds_ib_send_cqe_handler(ic, wc); + else + rds_ib_mr_cqe_handler(ic, wc); + } } } diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c new file mode 100644 index 000..a86de13 --- /dev/null +++ b/net/rds/ib_frmr.c @@ -0,0 +1,376 @@ +/* + * Copyright (c) 2016 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + *copyright notice, this list of conditions and the following + *disclaimer. + * + * - Redistributions in binary form must reproduce the above + *copyright notice, this list of conditions and the following + *disclaimer in the documentation and/or other materials + *provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include "ib_mr.h" + +static struct rds_ib_mr *rds_ib_alloc_frmr(struct rds_ib_device *rds_ibdev, + int npages) +{ + struct rds_ib_mr_pool *pool; + struct rds_ib_mr *ibmr = NULL; + struct rds_ib_frmr *frmr; + int err = 0; + + if (npages <= RDS_MR_8K_MSG_SIZE) + pool = rds_ibdev->mr_8k_pool; + else + pool = rds_ibdev->mr_1m_pool; + + ibmr = rds_ib_try_reuse_ibmr(pool); + if (ibmr) + retur
[net-next][PATCH 06/13] RDS: IB: create struct rds_ib_fmr
Keep fmr related filed in its own struct. Fastreg MR structure will be added to the union. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib_fmr.c | 17 ++--- net/rds/ib_mr.h | 11 +-- net/rds/ib_rdma.c | 14 ++ 3 files changed, 29 insertions(+), 13 deletions(-) diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c index d4f200d..74f2c21 100644 --- a/net/rds/ib_fmr.c +++ b/net/rds/ib_fmr.c @@ -36,6 +36,7 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev, int npages) { struct rds_ib_mr_pool *pool; struct rds_ib_mr *ibmr = NULL; + struct rds_ib_fmr *fmr; int err = 0, iter = 0; if (npages <= RDS_MR_8K_MSG_SIZE) @@ -99,15 +100,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev, int npages) goto out_no_cigar; } - ibmr->fmr = ib_alloc_fmr(rds_ibdev->pd, + fmr = &ibmr->u.fmr; + fmr->fmr = ib_alloc_fmr(rds_ibdev->pd, (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ | IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_ATOMIC), &pool->fmr_attr); - if (IS_ERR(ibmr->fmr)) { - err = PTR_ERR(ibmr->fmr); - ibmr->fmr = NULL; + if (IS_ERR(fmr->fmr)) { + err = PTR_ERR(fmr->fmr); + fmr->fmr = NULL; pr_warn("RDS/IB: %s failed (err=%d)\n", __func__, err); goto out_no_cigar; } @@ -122,8 +124,8 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev, int npages) out_no_cigar: if (ibmr) { - if (ibmr->fmr) - ib_dealloc_fmr(ibmr->fmr); + if (fmr->fmr) + ib_dealloc_fmr(fmr->fmr); kfree(ibmr); } atomic_dec(&pool->item_count); @@ -134,6 +136,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct rds_ib_mr *ibmr, struct scatterlist *sg, unsigned int nents) { struct ib_device *dev = rds_ibdev->dev; + struct rds_ib_fmr *fmr = &ibmr->u.fmr; struct scatterlist *scat = sg; u64 io_addr = 0; u64 *dma_pages; @@ -190,7 +193,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct rds_ib_mr *ibmr, (dma_addr & PAGE_MASK) + j; } - ret = ib_map_phys_fmr(ibmr->fmr, dma_pages, page_cnt, io_addr); + ret = ib_map_phys_fmr(fmr->fmr, dma_pages, page_cnt, io_addr); if (ret) goto out; diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h index d88724f..309ad59 100644 --- a/net/rds/ib_mr.h +++ b/net/rds/ib_mr.h @@ -43,11 +43,15 @@ #define RDS_MR_8K_SCALE(256 / (RDS_MR_8K_MSG_SIZE + 1)) #define RDS_MR_8K_POOL_SIZE(RDS_MR_8K_SCALE * (8192 / 2)) +struct rds_ib_fmr { + struct ib_fmr *fmr; + u64 *dma; +}; + /* This is stored as mr->r_trans_private. */ struct rds_ib_mr { struct rds_ib_device*device; struct rds_ib_mr_pool *pool; - struct ib_fmr *fmr; struct llist_node llnode; @@ -57,8 +61,11 @@ struct rds_ib_mr { struct scatterlist *sg; unsigned intsg_len; - u64 *dma; int sg_dma_len; + + union { + struct rds_ib_fmr fmr; + } u; }; /* Our own little MR pool */ diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index c594519..9e608d9 100644 --- a/net/rds/ib_rdma.c +++ b/net/rds/ib_rdma.c @@ -334,6 +334,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, int free_all, struct rds_ib_mr **ibmr_ret) { struct rds_ib_mr *ibmr, *next; + struct rds_ib_fmr *fmr; struct llist_node *clean_nodes; struct llist_node *clean_tail; LIST_HEAD(unmap_list); @@ -395,8 +396,10 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, goto out; /* String all ib_mr's onto one list and hand them to ib_unmap_fmr */ - list_for_each_entry(ibmr, &unmap_list, unmap_list) - list_add(&ibmr->fmr->list, &fmr_list); + list_for_each_entry(ibmr, &unmap_list, unmap_list) { + fmr = &ibmr->u.fmr; + list_add(&fmr->fmr->list, &fmr_list); + } ret = ib_unmap_fmr(&fmr_list); if (ret) @@ -405,6 +408,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, /* Now we can destroy the DMA mapping and unpin any pages */ list_for_each_entry_safe(ibmr, next, &unmap_list, unmap_list) { unpinned += ibmr->sg_len; +
[net-next][PATCH 08/13] RDS: IB: add connection info to ibmr
Preperatory patch for FRMR support. From connection info, we can retrieve cm_id which contains qp handled needed for work request posting. We also need to drop the RDS connection on QP error states where connection handle becomes useful. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib_mr.h | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h index f5c1fcb..add7725 100644 --- a/net/rds/ib_mr.h +++ b/net/rds/ib_mr.h @@ -50,18 +50,19 @@ struct rds_ib_fmr { /* This is stored as mr->r_trans_private. */ struct rds_ib_mr { - struct rds_ib_device*device; - struct rds_ib_mr_pool *pool; + struct rds_ib_device*device; + struct rds_ib_mr_pool *pool; + struct rds_ib_connection*ic; - struct llist_node llnode; + struct llist_node llnode; /* unmap_list is for freeing */ - struct list_headunmap_list; - unsigned intremap_count; + struct list_headunmap_list; + unsigned intremap_count; - struct scatterlist *sg; - unsigned intsg_len; - int sg_dma_len; + struct scatterlist *sg; + unsigned intsg_len; + int sg_dma_len; union { struct rds_ib_fmr fmr; -- 1.9.1