Re: [PATCH v2 1/3] scsi_cmnd: Introduce scsi_transfer_length helper

2014-06-23 Thread Mike Christie
On 06/11/2014 04:09 AM, Sagi Grimberg wrote:
> In case protection information exists on the wire
> scsi transports should include it in the transfer
> byte count (even if protection information does not
> exist in the host memory space). This helper will
> compute the total transfer length from the scsi
> command data length and protection attributes.
> 
> Signed-off-by: Sagi Grimberg 
> Signed-off-by: Martin K. Petersen 
> ---
>  include/scsi/scsi_cmnd.h |   17 +
>  1 files changed, 17 insertions(+), 0 deletions(-)
> 
> diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h
> index dd7c998..a100c6e 100644
> --- a/include/scsi/scsi_cmnd.h
> +++ b/include/scsi/scsi_cmnd.h
> @@ -7,6 +7,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  struct Scsi_Host;
>  struct scsi_device;
> @@ -306,4 +307,20 @@ static inline void set_driver_byte(struct scsi_cmnd 
> *cmd, char status)
>   cmd->result = (cmd->result & 0x00ff) | (status << 24);
>  }
>  
> +static inline unsigned scsi_transfer_length(struct scsi_cmnd *scmd)
> +{
> + unsigned int xfer_len = blk_rq_bytes(scmd->request);
> + unsigned int prot_op = scsi_get_prot_op(scmd);
> + unsigned int sector_size = scmd->device->sector_size;
> +
> + switch (prot_op) {
> + case SCSI_PROT_NORMAL:
> + case SCSI_PROT_WRITE_STRIP:
> + case SCSI_PROT_READ_INSERT:
> + return xfer_len;
> + }
> +
> + return xfer_len + (xfer_len >> ilog2(sector_size)) * 8;
> +}
> +
>  #endif /* _SCSI_SCSI_CMND_H */
> 

I found the issue Christoph is hitting in the other thread.

The problem is WRITE_SAME requests are setup so that req->__data_len is
the value of the entire request when the setup is completed but during
the setup process it's value changes

So __data_len could be thousands of bytes but
scsi_out(scsi_cmnd)->length for this case was only returning 512 which
is the sector size. This is because sd_setup_-write_same_cmnd does:


rq->__data_len = sdp->sector_size;

scsi_setup_blk_pc_cmnd()

rq->__data_len = nr_bytes;

and scsi_setup_blk_pc_cmnd does scsi_init_io() -> scsi_init_sgtable()
and that does

sdb->length = blk_rq_bytes(req);

and at this time because before we called scsi_setup_blk_pc_cmnd we set
the __data_len to sector size, the sdb length is going to be only 512
but the final request->__data_len is the total size of the operation.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] libiscsi, iser: Adjust data_length to include protection information

2014-06-23 Thread Mike Christie
On 06/23/2014 03:59 PM, Christoph Hellwig wrote:
> This patch causes a regression when using the iscsi initiator over
> TCP for me. When mounting a newly created ext4 filesystem I get the
> following BUG: 
> 
> [   31.611803] BUG: unable to handle kernel NULL pointer dereference at 
> 000c
> [   31.613563] IP: [] iscsi_tcp_segment_done+0x2bd/0x380
> [   31.613563] PGD 7a3e4067 PUD 7a45f067 PMD 0 
> [   31.613563] Oops:  [#1] SMP 
> [   31.613563] Modules linked in:
> [   31.613563] CPU: 3 PID: 3739 Comm: kworker/u8:5 Not tainted 3.16.0-rc2 #187
> [   31.613563] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> [   31.613563] Workqueue: iscsi_q_2 iscsi_xmitworker
> [   31.613563] task: 88007b33cf10 ti: 88007ad94000 task.ti: 
> 88007ad94000
> [   31.613563] RIP: 0010:[]  [] 
> iscsi_tcp_segment_done+0x2bd/0x380
> [   31.613563] RSP: 0018:88007ad97b38  EFLAGS: 00010246
> [   31.613563] RAX:  RBX: 88007cd67910 RCX: 
> 0200
> [   31.613563] RDX: 2000 RSI:  RDI: 
> 88007cd67910
> [   31.613563] RBP: 88007ad97b98 R08: 0200 R09: 
> 
> [   31.613563] R10:  R11: 0001 R12: 
> 
> [   31.613563] R13: 88007cd67780 R14:  R15: 
> 
> [   31.613563] FS:  () GS:88007fd8() 
> knlGS:
> [   31.613563] CS:  0010 DS:  ES:  CR0: 8005003b
> [   31.613563] CR2: 000c CR3: 7afd9000 CR4: 
> 06e0
> [   31.613563] Stack:
> [   31.613563]  88007ad97b98 81c68fd6 81c68f20 
> 88007c8e37c8
> [   31.613563]  7b33d728 88007dc805b0 88007ad97c58 
> 0200
> [   31.613563]  88007cd67780 88c00040 88007ad97c00 
> 88007cd67910
> [   31.613563] Call Trace:
> [   31.613563]  [] ? inet_sendpage+0xb6/0x130
> [   31.613563]  [] ? inet_dgram_connect+0x80/0x80
> [   31.613563]  [] iscsi_sw_tcp_pdu_xmit+0xe5/0x2e0
> [   31.613563]  [] ? iscsi_sw_tcp_pdu_init+0x1bf/0x390
> [   31.613563]  [] iscsi_tcp_task_xmit+0xa2/0x2b0
> [   31.613563]  [] ? iscsi_xmit_task+0x45/0xd0
> [   31.613563]  [] ? trace_hardirqs_on+0xd/0x10
> [   31.613563]  [] ? __local_bh_enable_ip+0x70/0xd0
> [   31.613563]  [] iscsi_xmit_task+0x59/0xd0
> [   31.613563]  [] iscsi_xmitworker+0x288/0x330
> [   31.613563]  [] process_one_work+0x1c7/0x490
> [   31.613563]  [] ? process_one_work+0x15d/0x490
> [   31.613563]  [] worker_thread+0x119/0x4f0
> [   31.613563]  [] ? trace_hardirqs_on+0xd/0x10
> [   31.613563]  [] ? init_pwq+0x190/0x190
> [   31.613563]  [] kthread+0xdf/0x100
> [   31.613563]  [] ? __init_kthread_worker+0x70/0x70
> [   31.613563]  [] ret_from_fork+0x7c/0xb0
> [   31.613563]  [] ? __init_kthread_worker+0x70/0x70
> [   31.613563] Code: 89 03 31 c0 e9 cc fe ff ff 0f 1f 44 00 00 48 8b 7b
> 30 e8 17 74 de ff 8b 53 10 c7 43 40 00 00 00 00 48 89 43 30 44 89 f6 48
> 89 df <8b> 40 0c 48 c7 03 00 00 00 00 2b 53 14 39 c2 0f 47 d0 89 53 08 
> 
> 
> (gdb) l *(iscsi_tcp_segment_done+0x2bd)
> 0x8197b38d is in iscsi_tcp_segment_done
> (../drivers/scsi/libiscsi_tcp.c:102).
> 97iscsi_tcp_segment_init_sg(struct iscsi_segment *segment,
> 98  struct scatterlist *sg, unsigned int offset)
> 99{
> 100   segment->sg = sg;
> 101   segment->sg_offset = offset;
> 102   segment->size = min(sg->length - offset,
> 103   segment->total_size - 
> segment->total_copied);
> 104   segment->data = NULL;
> 105   }
> 106   
> 


Ok, it looks like scsi_out(scsi_cmnd)->length (iscsi_tcp/libiscsi_tcp
still uses that for lower level operations since it was not converted to
support t10 pi) returns a different value than scsi_transfer_length()
(libiscsi uses this for higher level operations when it was converted to
t10 support since iser uses that module and also has t10 support) for
some commands. We then end up incorrectly thinking some requests are the
wrong size and then hit this. Looking into why exactly this happens.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/3] scsi_cmnd: Introduce scsi_transfer_length helper

2014-06-23 Thread Martin K. Petersen
> "Mike" == Mike Christie  writes:
>> + unsigned int xfer_len = blk_rq_bytes(scmd->request);

Mike> Can you do bidi and dif/dix? 

Nope.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/22] Add and use pci_zalloc_consistent

2014-06-23 Thread Julian Calaby
Hi Joe,

On Tue, Jun 24, 2014 at 5:13 AM, Joe Perches  wrote:
> On Mon, 2014-06-23 at 10:25 -0700, Luis R. Rodriguez wrote:
>> On Mon, Jun 23, 2014 at 06:41:28AM -0700, Joe Perches wrote:
>> > Adding the helper reduces object code size as well as overall
>> > source size line count.
>> >
>> > It's also consistent with all the various zalloc mechanisms
>> > in the kernel.
>> >
>> > Done with a simple cocci script and some typing.
>>
>> Awesome, any chance you can paste in the SmPL? Also any chance
>> we can get this added to a make coccicheck so that maintainers
>> moving forward can use that to ensure that no new code is
>> added that uses the old school API?
>
> Not many of these are recent.
>
> Arnd Bergmann reasonably suggested that the pci_alloc_consistent
> api be converted the the more widely used dma_alloc_coherent.
>
> https://lkml.org/lkml/2014/6/23/513
>
>> Shouldn't these drivers just use the normal dma-mapping API now?
>
> and I replied:
>
> https://lkml.org/lkml/2014/6/23/525
>
>> Maybe.  I wouldn't mind.
>> They do seem to have a trivial bit of unnecessary overhead for
>> hwdev == NULL ? NULL : &hwdev->dev
>
> Anyway, here's the little script.
> I'm not sure it's worthwhile to add it though.
>
> $ cat ./scripts/coccinelle/api/alloc/pci_zalloc_consistent.cocci
> ///
> /// Use pci_zalloc_consistent rather than
> /// pci_alloc_consistent followed by memset with 0
> ///
> /// This considers some simple cases that are common and easy to validate
> /// Note in particular that there are no ...s in the rule, so all of the
> /// matched code has to be contiguous
> ///
> /// Blatantly cribbed from: scripts/coccinelle/api/alloc/kzalloc-simple.cocci
>
> @@
> type T, T2;
> expression x;
> expression E1,E2,E3;
> statement S;
> @@
>
> - x = (T)pci_alloc_consistent(E1,E2,E3);
> + x = pci_zalloc_consistent(E1,E2,E3);
>   if ((x==NULL) || ...) S
> - memset((T2)x,0,E2);

I don't know much about SmPL, but wouldn't having that if statement
there reduce your matches?

Thanks,

-- 
Julian Calaby

Email: julian.cal...@gmail.com
Profile: http://www.google.com/profiles/julian.calaby/
.Plan: http://sites.google.com/site/juliancalaby/
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PCI/AER: AER in SRIOV environment

2014-06-23 Thread Alex Williamson
On Tue, 2014-06-24 at 01:44 +0300, Yishai Hadas wrote:
> On 6/23/2014 11:12 PM, Don Dutile wrote:
> > On 06/23/2014 03:09 PM, Bjorn Helgaas wrote:
> >> [+cc linux-pci, Don]
> >>
> > Adding Alex Williamson in case he can add more to this conversation...
> >
> >> On Mon, Jun 23, 2014 at 8:29 AM, Yishai Hadas
> >>  wrote:
> >>> Hi Vijay,
> >>> Trying to add AER support for Mellanox NIC in SRIOV environment, while
> >>> evaluating/testing encountered a problem which led me to your
> >>> patch accepted as part of kernel 3.8, commit ID
> >>> "918b4053184c0ca22236e70e299c5343eea35304".
> >>>
> >>> Have some concerns/questions on:
> >>> When working in SRIOV environment VFs may be un-attached, having no 
> >>> driver
> >>> assigned to, or may be attached to Virtual machine to work in some
> >>> pass-through mode.
> >>> Once working in KVM setup there is pci-stub driver which is loaded 
> >>> in the
> >>> HYP/PF for a given attached VF.
> > huh? 'loaded in the hyp/pf?  um, loaded in the host, and a VF is
> > detached from its host driver -- a VF can be used in the host w/o any 
> > virtualization,
> > i.e., that's how guest VM is driving the VF: as if it was used by a 
> > guest (host) OS directly --
> > and attached to pci-stub driver, when assigned to a KVM guest in 
> > pre-VFIO days/ways.
> > If VFIO used, then VF is attached to vfio-pci driver.
> >
> >>>
> >>> I'm using the aer-inject kernel module and its corresponding 
> >>> aer-inject tool
> >>> to simulate an error in the HYP.
> >>> In both cases your commit will cause the AER recovery to fail as 
> >>> there is no
> >>> driver assigned to PF's VFs that supports AER, comparing the code 
> >>> before
> >>> your change.
> >>>
> > Without VFIO, I believe that's correct. There was no AER-to-VF support 
> > pre-VFIO days.
> > I believe with the recent VFIO support,
> > and modifications to KVM, an AER that is associated with an assigned 
> > VF will
> > force the crash/halt of the KVM guest -- can't depend on a guest VF 
> > driver clearing
> > the AER in the hyp/host -- guest isn't privileged enough to clear the 
> > error.
> > So, crashing the guest is the simple option at the moment, to contain 
> > the error.
> > Alex: do I have that (vfio aer default) correct, or is that still 
> > site-under-construction?
>  How about the case that the VF is not attached to a KVM guest and 
> has no driver loaded on host ? in such a case from code review and some 
> testing the recovery will
>  fail as there is no AER aware driver here. What is the expected 
> solution here ?
>  Any special qemu /stuff is needed to activate the VFIO support ? 
> would like to give it a try for a case that VF is attached.

Just use recent QEMU (>=1.6, the newer the better) and it should be
automatic.  Note that the VM won't exit on error, it's stopped with
state RUN_STATE_IO_ERROR to allow the possibility of collecting data.
Thanks,

Alex

> >
> >>> How such cases should work ?  my expectation was that the PF will 
> >>> get the
> >>> error detected message then will recognize whether
> >>> issue is its own or one of its VFs
> > The AER packet will have the tag of the VF in if it was the source of 
> > the error;
> > so the PF will never see it; although one could argue it should be 
> > 'promoted'
> > to the PF if PF/VF needs to clear some state it has wrt the VF (the 
> > SRIOV spec is
> > lacking of info in this space); _but_, VFIO resets the VF (sets FLR 
> > bit) when the
> > device is deassigned and before re-attachment to the host, so that 
> > should clear out
> > any state btwn PF & VF ('should' ... famous last words...).
>  In my test I have used the aer-inject tool simulating an error to 
> the BUS that both PF/VF are residing on, putting the function number to 
> be the PF one, looks like both should be called by the aer driver as part
>  of the pci_walk_bus(). As mentioned I got a call only on the PF and 
> recovery failed as of the VF doesn't include an AER aware driver, once 
> removed the VF recovery succeeded.
>  I believe that packet should include some info about the source of 
> the error isn't it ?
>  In addition, looking at IXGBE upstream source code at 
> ixgbe_error_detected()  looks like there is some code running on the PF 
> that checks whether the source was a VF.
> 
>  By the way: when tried to simulate a VF error using its FN got 
> below error:
>  "Error: Failed to write, Inappropriate ioctl for device", any idea 
> about that error ?
> >
> >>
> >> I'm really not an AER expert, so help me understand this question of
> >> recognizing whether an error is associated with a PF or a VF.
> >>
> >> In terms of hardware, it looks like the device that detects an error
> >> logs some information and sends an Error Message upstream.  The Root
> >> Complex receives the message, captures the source ID from the Error
> >> Message, and may generate an interrupt.  I expect this source ID can
> >> be either a PF or a VF; there's no requ

Re: PCI/AER: AER in SRIOV environment

2014-06-23 Thread Alex Williamson
On Mon, 2014-06-23 at 16:12 -0400, Don Dutile wrote:
> On 06/23/2014 03:09 PM, Bjorn Helgaas wrote:
> > [+cc linux-pci, Don]
> >
> Adding Alex Williamson in case he can add more to this conversation...
> 
> > On Mon, Jun 23, 2014 at 8:29 AM, Yishai Hadas
> >  wrote:
> >> Hi Vijay,
> >> Trying to add AER support for Mellanox NIC in SRIOV environment, while
> >> evaluating/testing encountered a problem which led me to your
> >> patch accepted as part of kernel 3.8, commit ID
> >> "918b4053184c0ca22236e70e299c5343eea35304".
> >>
> >> Have some concerns/questions on:
> >> When working in SRIOV environment VFs may be un-attached, having no driver
> >> assigned to, or may be attached to Virtual machine to work in some
> >> pass-through mode.
> >> Once working in KVM setup there is pci-stub driver which is loaded in the
> >> HYP/PF for a given attached VF.
> huh? 'loaded in the hyp/pf?  um, loaded in the host, and a VF is
> detached from its host driver -- a VF can be used in the host w/o any 
> virtualization,
> i.e., that's how guest VM is driving the VF: as if it was used by a guest 
> (host) OS directly --
> and attached to pci-stub driver, when assigned to a KVM guest in pre-VFIO 
> days/ways.
> If VFIO used, then VF is attached to vfio-pci driver.
> 
> >>
> >> I'm using the aer-inject kernel module and its corresponding aer-inject 
> >> tool
> >> to simulate an error in the HYP.
> >> In both cases your commit will cause the AER recovery to fail as there is 
> >> no
> >> driver assigned to PF's VFs that supports AER, comparing the code before
> >> your change.
> >>
> Without VFIO, I believe that's correct. There was no AER-to-VF support 
> pre-VFIO days.
> I believe with the recent VFIO support,
> and modifications to KVM, an AER that is associated with an assigned VF will
> force the crash/halt of the KVM guest -- can't depend on a guest VF driver 
> clearing
> the AER in the hyp/host -- guest isn't privileged enough to clear the error.
> So, crashing the guest is the simple option at the moment, to contain the 
> error.
> Alex: do I have that (vfio aer default) correct, or is that still 
> site-under-construction?

Yep, any kind of recovery is TBD, we just send an eventfd signal that an
error occurred and QEMU handles it by stopping the guest.  Not sure I
can add much more to the conversation, but this is exactly the sort of
thing that makes legacy kvm device assignment and pci-stub a bad design.
Thanks,

Alex

> >> How such cases should work ?  my expectation was that the PF will get the
> >> error detected message then will recognize whether
> >> issue is its own or one of its VFs
> The AER packet will have the tag of the VF in if it was the source of the 
> error;
> so the PF will never see it; although one could argue it should be 'promoted'
> to the PF if PF/VF needs to clear some state it has wrt the VF (the SRIOV 
> spec is
> lacking of info in this space); _but_, VFIO resets the VF (sets FLR bit) when 
> the
> device is deassigned and before re-attachment to the host, so that should 
> clear out
> any state btwn PF & VF ('should' ... famous last words...).
> 
> >
> > I'm really not an AER expert, so help me understand this question of
> > recognizing whether an error is associated with a PF or a VF.
> >
> > In terms of hardware, it looks like the device that detects an error
> > logs some information and sends an Error Message upstream.  The Root
> > Complex receives the message, captures the source ID from the Error
> > Message, and may generate an interrupt.  I expect this source ID can
> > be either a PF or a VF; there's no requirement that a VF error must be
> > reported as though it's from the PF, is there?
> >
> >> and work accordingly, in current code
> >> looks like recovery failed as part of "voting" once there is no AER handler
> >> assigned to the VFs.
> >
> > The commit you mentioned has to do with PCI_ERS_RESULT_NO_AER_DRIVER.
> > We use pci_walk_bus() to figure out whether all the devices in a
> > subtree have a driver.  What subtree is involved here?  I would expect
> > the VFs to be siblings of the PF, not children of it, so I'm not sure
> > where things went wrong.
> Well, VFs could be on virtual busses (ARI turned on), so not necessarily a
> sibling to PF ... and then we have the problem in PCI code of not being able
> to traverse these virtual busses (in some cases; not sure if pci_walk_bus(),
> which is going down the tree vs up the tree, has any problems here w/VFs on
> virtual busses).
> 
> >
> > Can you collect "lspci -vvv" output and maybe add some debug so we can
> > see exactly where the error is detected and what devices we're looking
> > at to conclude that one of them doesn't have a driver?
> >
> > Bjorn
> >
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubs

Re: PCI/AER: AER in SRIOV environment

2014-06-23 Thread Yishai Hadas

On 6/23/2014 11:12 PM, Don Dutile wrote:

On 06/23/2014 03:09 PM, Bjorn Helgaas wrote:

[+cc linux-pci, Don]


Adding Alex Williamson in case he can add more to this conversation...


On Mon, Jun 23, 2014 at 8:29 AM, Yishai Hadas
 wrote:

Hi Vijay,
Trying to add AER support for Mellanox NIC in SRIOV environment, while
evaluating/testing encountered a problem which led me to your
patch accepted as part of kernel 3.8, commit ID
"918b4053184c0ca22236e70e299c5343eea35304".

Have some concerns/questions on:
When working in SRIOV environment VFs may be un-attached, having no 
driver

assigned to, or may be attached to Virtual machine to work in some
pass-through mode.
Once working in KVM setup there is pci-stub driver which is loaded 
in the

HYP/PF for a given attached VF.

huh? 'loaded in the hyp/pf?  um, loaded in the host, and a VF is
detached from its host driver -- a VF can be used in the host w/o any 
virtualization,
i.e., that's how guest VM is driving the VF: as if it was used by a 
guest (host) OS directly --
and attached to pci-stub driver, when assigned to a KVM guest in 
pre-VFIO days/ways.

If VFIO used, then VF is attached to vfio-pci driver.



I'm using the aer-inject kernel module and its corresponding 
aer-inject tool

to simulate an error in the HYP.
In both cases your commit will cause the AER recovery to fail as 
there is no
driver assigned to PF's VFs that supports AER, comparing the code 
before

your change.

Without VFIO, I believe that's correct. There was no AER-to-VF support 
pre-VFIO days.

I believe with the recent VFIO support,
and modifications to KVM, an AER that is associated with an assigned 
VF will
force the crash/halt of the KVM guest -- can't depend on a guest VF 
driver clearing
the AER in the hyp/host -- guest isn't privileged enough to clear the 
error.
So, crashing the guest is the simple option at the moment, to contain 
the error.
Alex: do I have that (vfio aer default) correct, or is that still 
site-under-construction?
How about the case that the VF is not attached to a KVM guest and 
has no driver loaded on host ? in such a case from code review and some 
testing the recovery will
fail as there is no AER aware driver here. What is the expected 
solution here ?
Any special qemu /stuff is needed to activate the VFIO support ? 
would like to give it a try for a case that VF is attached.


How such cases should work ?  my expectation was that the PF will 
get the

error detected message then will recognize whether
issue is its own or one of its VFs
The AER packet will have the tag of the VF in if it was the source of 
the error;
so the PF will never see it; although one could argue it should be 
'promoted'
to the PF if PF/VF needs to clear some state it has wrt the VF (the 
SRIOV spec is
lacking of info in this space); _but_, VFIO resets the VF (sets FLR 
bit) when the
device is deassigned and before re-attachment to the host, so that 
should clear out

any state btwn PF & VF ('should' ... famous last words...).
In my test I have used the aer-inject tool simulating an error to 
the BUS that both PF/VF are residing on, putting the function number to 
be the PF one, looks like both should be called by the aer driver as part
of the pci_walk_bus(). As mentioned I got a call only on the PF and 
recovery failed as of the VF doesn't include an AER aware driver, once 
removed the VF recovery succeeded.
I believe that packet should include some info about the source of 
the error isn't it ?
In addition, looking at IXGBE upstream source code at 
ixgbe_error_detected()  looks like there is some code running on the PF 
that checks whether the source was a VF.


By the way: when tried to simulate a VF error using its FN got 
below error:
"Error: Failed to write, Inappropriate ioctl for device", any idea 
about that error ?




I'm really not an AER expert, so help me understand this question of
recognizing whether an error is associated with a PF or a VF.

In terms of hardware, it looks like the device that detects an error
logs some information and sends an Error Message upstream.  The Root
Complex receives the message, captures the source ID from the Error
Message, and may generate an interrupt.  I expect this source ID can
be either a PF or a VF; there's no requirement that a VF error must be
reported as though it's from the PF, is there?


and work accordingly, in current code
looks like recovery failed as part of "voting" once there is no AER 
handler

assigned to the VFs.


The commit you mentioned has to do with PCI_ERS_RESULT_NO_AER_DRIVER.
We use pci_walk_bus() to figure out whether all the devices in a
subtree have a driver.  What subtree is involved here?  I would expect
the VFs to be siblings of the PF, not children of it, so I'm not sure
where things went wrong.
Well, VFs could be on virtual busses (ARI turned on), so not 
necessarily a
sibling to PF ... and then we have the problem in PCI code of not 
being able
to traverse th

[PATCH v1 11/13] xprtrdma: Clean up rpcrdma_ep_disconnect()

2014-06-23 Thread Chuck Lever
The return code is used only for dprintk's that are already
redundant.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/transport.c |2 +-
 net/sunrpc/xprtrdma/verbs.c |   13 +++--
 net/sunrpc/xprtrdma/xprt_rdma.h |2 +-
 3 files changed, 5 insertions(+), 12 deletions(-)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index f6d280b..2faac49 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -414,7 +414,7 @@ xprt_rdma_close(struct rpc_xprt *xprt)
if (r_xprt->rx_ep.rep_connected > 0)
xprt->reestablish_timeout = 0;
xprt_disconnect_done(xprt);
-   (void) rpcrdma_ep_disconnect(&r_xprt->rx_ep, &r_xprt->rx_ia);
+   rpcrdma_ep_disconnect(&r_xprt->rx_ep, &r_xprt->rx_ia);
 }
 
 static void
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 52f57f7..b6c52c7 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -838,10 +838,7 @@ rpcrdma_ep_destroy(struct rpcrdma_ep *ep, struct 
rpcrdma_ia *ia)
cancel_delayed_work_sync(&ep->rep_connect_worker);
 
if (ia->ri_id->qp) {
-   rc = rpcrdma_ep_disconnect(ep, ia);
-   if (rc)
-   dprintk("RPC:   %s: rpcrdma_ep_disconnect"
-   " returned %i\n", __func__, rc);
+   rpcrdma_ep_disconnect(ep, ia);
rdma_destroy_qp(ia->ri_id);
ia->ri_id->qp = NULL;
}
@@ -879,10 +876,7 @@ rpcrdma_ep_connect(struct rpcrdma_ep *ep, struct 
rpcrdma_ia *ia)
struct rpcrdma_xprt *xprt;
 retry:
dprintk("RPC:   %s: reconnecting...\n", __func__);
-   rc = rpcrdma_ep_disconnect(ep, ia);
-   if (rc && rc != -ENOTCONN)
-   dprintk("RPC:   %s: rpcrdma_ep_disconnect"
-   " status %i\n", __func__, rc);
+   rpcrdma_ep_disconnect(ep, ia);
 
xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
id = rpcrdma_create_id(xprt, ia,
@@ -988,7 +982,7 @@ out:
  * This call is not reentrant, and must not be made in parallel
  * on the same endpoint.
  */
-int
+void
 rpcrdma_ep_disconnect(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
 {
int rc;
@@ -1004,7 +998,6 @@ rpcrdma_ep_disconnect(struct rpcrdma_ep *ep, struct 
rpcrdma_ia *ia)
dprintk("RPC:   %s: rdma_disconnect %i\n", __func__, rc);
ep->rep_connected = rc;
}
-   return rc;
 }
 
 /*
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 7a140fe..4f7de2a 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -343,7 +343,7 @@ int rpcrdma_ep_create(struct rpcrdma_ep *, struct 
rpcrdma_ia *,
struct rpcrdma_create_data_internal *);
 void rpcrdma_ep_destroy(struct rpcrdma_ep *, struct rpcrdma_ia *);
 int rpcrdma_ep_connect(struct rpcrdma_ep *, struct rpcrdma_ia *);
-int rpcrdma_ep_disconnect(struct rpcrdma_ep *, struct rpcrdma_ia *);
+void rpcrdma_ep_disconnect(struct rpcrdma_ep *, struct rpcrdma_ia *);
 
 int rpcrdma_ep_post(struct rpcrdma_ia *, struct rpcrdma_ep *,
struct rpcrdma_req *);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 12/13] xprtrdma: Remove RPCRDMA_PERSISTENT_REGISTRATION macro

2014-06-23 Thread Chuck Lever
Clean up.

RPCRDMA_PERSISTENT_REGISTRATION was a compile-time switch between
RPCRDMA_REGISTER mode and RPCRDMA_ALLPHYSICAL mode.  Since
RPCRDMA_REGISTER has been removed, there's no need for the extra
conditional compilation.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/xprtrdma.h |2 --
 net/sunrpc/xprtrdma/verbs.c |   13 -
 2 files changed, 15 deletions(-)

diff --git a/include/linux/sunrpc/xprtrdma.h b/include/linux/sunrpc/xprtrdma.h
index c2f04e1..64a0a0a 100644
--- a/include/linux/sunrpc/xprtrdma.h
+++ b/include/linux/sunrpc/xprtrdma.h
@@ -62,8 +62,6 @@
 #define RPCRDMA_INLINE_PAD_THRESH  (512)/* payload threshold to pad (bytes) */
 
 /* memory registration strategies */
-#define RPCRDMA_PERSISTENT_REGISTRATION (1)
-
 enum rpcrdma_memreg {
RPCRDMA_BOUNCEBUFFERS = 0,
RPCRDMA_REGISTER,
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index b6c52c7..ec98e48 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -569,12 +569,7 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr 
*addr, int memreg)
if (!ia->ri_id->device->alloc_fmr) {
dprintk("RPC:   %s: MTHCAFMR registration "
"not supported by HCA\n", __func__);
-#if RPCRDMA_PERSISTENT_REGISTRATION
memreg = RPCRDMA_ALLPHYSICAL;
-#else
-   rc = -ENOMEM;
-   goto out2;
-#endif
}
}
 
@@ -589,20 +584,16 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct 
sockaddr *addr, int memreg)
switch (memreg) {
case RPCRDMA_FRMR:
break;
-#if RPCRDMA_PERSISTENT_REGISTRATION
case RPCRDMA_ALLPHYSICAL:
mem_priv = IB_ACCESS_LOCAL_WRITE |
IB_ACCESS_REMOTE_WRITE |
IB_ACCESS_REMOTE_READ;
goto register_setup;
-#endif
case RPCRDMA_MTHCAFMR:
if (ia->ri_have_dma_lkey)
break;
mem_priv = IB_ACCESS_LOCAL_WRITE;
-#if RPCRDMA_PERSISTENT_REGISTRATION
register_setup:
-#endif
ia->ri_bind_mem = ib_get_dma_mr(ia->ri_pd, mem_priv);
if (IS_ERR(ia->ri_bind_mem)) {
printk(KERN_ALERT "%s: ib_get_dma_mr for "
@@ -1770,7 +1761,6 @@ rpcrdma_register_external(struct rpcrdma_mr_seg *seg,
 
switch (ia->ri_memreg_strategy) {
 
-#if RPCRDMA_PERSISTENT_REGISTRATION
case RPCRDMA_ALLPHYSICAL:
rpcrdma_map_one(ia, seg, writing);
seg->mr_rkey = ia->ri_bind_mem->rkey;
@@ -1778,7 +1768,6 @@ rpcrdma_register_external(struct rpcrdma_mr_seg *seg,
seg->mr_nsegs = 1;
nsegs = 1;
break;
-#endif
 
/* Registration using frmr registration */
case RPCRDMA_FRMR:
@@ -1808,11 +1797,9 @@ rpcrdma_deregister_external(struct rpcrdma_mr_seg *seg,
 
switch (ia->ri_memreg_strategy) {
 
-#if RPCRDMA_PERSISTENT_REGISTRATION
case RPCRDMA_ALLPHYSICAL:
rpcrdma_unmap_one(ia, seg);
break;
-#endif
 
case RPCRDMA_FRMR:
rc = rpcrdma_deregister_frmr_external(seg, ia, r_xprt);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 13/13] xprtrdma: Handle additional connection events

2014-06-23 Thread Chuck Lever
Commit 38ca83a5 added RDMA_CM_EVENT_TIMEWAIT_EXIT. But that status
is relevant only for consumers that re-use their QPs on new
connections. xprtrdma creates a fresh QP on reconnection, so that
event should be explicitly ignored.

Squelch the alarming "unexpected CM event" message.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/verbs.c |   27 +--
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index ec98e48..dbd5f22 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -334,8 +334,16 @@ static const char * const conn[] = {
"rejected",
"established",
"disconnected",
-   "device removal"
+   "device removal",
+   "multicast join",
+   "multicast error",
+   "address change",
+   "timewait exit",
 };
+
+#define CONNECTION_MSG(status) \
+   ((status) < ARRAY_SIZE(conn) ?  \
+   conn[(status)] : "unrecognized connection error")
 #endif
 
 static int
@@ -393,13 +401,10 @@ rpcrdma_conn_upcall(struct rdma_cm_id *id, struct 
rdma_cm_event *event)
case RDMA_CM_EVENT_DEVICE_REMOVAL:
connstate = -ENODEV;
 connected:
-   dprintk("RPC:   %s: %s: %pI4:%u (ep 0x%p event 0x%x)\n",
-   __func__,
-   (event->event <= 11) ? conn[event->event] :
-   "unknown connection error",
-   &addr->sin_addr.s_addr,
-   ntohs(addr->sin_port),
-   ep, event->event);
+   dprintk("RPC:   %s: %pI4:%u (ep 0x%p): %s\n",
+   __func__, &addr->sin_addr.s_addr,
+   ntohs(addr->sin_port), ep,
+   CONNECTION_MSG(event->event));
atomic_set(&rpcx_to_rdmax(ep->rep_xprt)->rx_buf.rb_credits, 1);
dprintk("RPC:   %s: %sconnected\n",
__func__, connstate > 0 ? "" : "dis");
@@ -408,8 +413,10 @@ connected:
wake_up_all(&ep->rep_connect_wait);
break;
default:
-   dprintk("RPC:   %s: unexpected CM event %d\n",
-   __func__, event->event);
+   dprintk("RPC:   %s: %pI4:%u (ep 0x%p): %s\n",
+   __func__, &addr->sin_addr.s_addr,
+   ntohs(addr->sin_port), ep,
+   CONNECTION_MSG(event->event));
break;
}
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 10/13] xprtrdma: Release FRMR segment buffers during LOCAL_INV completion

2014-06-23 Thread Chuck Lever
FRMR uses a LOCAL_INV Work Request, which is asynchronous, to
deregister segment buffers.  Other registration strategies use
synchronous deregistration mechanisms (like ib_unmap_fmr()).

For a synchronous deregistration mechanism, it makes sense for
xprt_rdma_free() to put segment buffers back into the buffer pool
immediately once rpcrdma_deregister_external() returns.

This is currently also what FRMR is doing. It is releasing segment
buffers just after the LOCAL_INV WR is posted.

But segment buffers need to be put back after the LOCAL_INV WR
_completes_ (or flushes). Otherwise, rpcrdma_buffer_get() can then
assign these segment buffers to another RPC task while they are
still "in use" by the hardware.

The result of re-using an FRMR too quickly is that it's rkey
no longer matches the rkey that was registered with the provider.
This results in FAST_REG_MR or LOCAL_INV Work Requests completing
with IB_WC_MW_BIND_ERR, and the FRMR, and thus the transport,
becomes unusable.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/verbs.c |   44 +++
 net/sunrpc/xprtrdma/xprt_rdma.h |2 ++
 2 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index f24f0bf..52f57f7 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -62,6 +62,8 @@
 #endif
 
 static void rpcrdma_decrement_frmr_rkey(struct rpcrdma_mw *);
+static void rpcrdma_get_mw(struct rpcrdma_mw *);
+static void rpcrdma_put_mw(struct rpcrdma_mw *);
 
 /*
  * internal functions
@@ -167,6 +169,7 @@ rpcrdma_sendcq_process_wc(struct ib_wc *wc)
if (fastreg)
rpcrdma_decrement_frmr_rkey(mw);
}
+   rpcrdma_put_mw(mw);
 }
 
 static int
@@ -1034,7 +1037,7 @@ rpcrdma_buffer_create(struct rpcrdma_buffer *buf, struct 
rpcrdma_ep *ep,
len += cdata->padding;
switch (ia->ri_memreg_strategy) {
case RPCRDMA_FRMR:
-   len += buf->rb_max_requests * RPCRDMA_MAX_SEGS *
+   len += (buf->rb_max_requests + 1) * RPCRDMA_MAX_SEGS *
sizeof(struct rpcrdma_mw);
break;
case RPCRDMA_MTHCAFMR:
@@ -1076,7 +1079,7 @@ rpcrdma_buffer_create(struct rpcrdma_buffer *buf, struct 
rpcrdma_ep *ep,
r = (struct rpcrdma_mw *)p;
switch (ia->ri_memreg_strategy) {
case RPCRDMA_FRMR:
-   for (i = buf->rb_max_requests * RPCRDMA_MAX_SEGS; i; i--) {
+   for (i = (buf->rb_max_requests+1) * RPCRDMA_MAX_SEGS; i; i--) {
r->r.frmr.fr_mr = ib_alloc_fast_reg_mr(ia->ri_pd,
ia->ri_max_frmr_depth);
if (IS_ERR(r->r.frmr.fr_mr)) {
@@ -1252,12 +1255,36 @@ rpcrdma_buffer_destroy(struct rpcrdma_buffer *buf)
 }
 
 static void
-rpcrdma_put_mw_locked(struct rpcrdma_mw *mw)
+rpcrdma_free_mw(struct kref *kref)
 {
+   struct rpcrdma_mw *mw = container_of(kref, struct rpcrdma_mw, mw_ref);
list_add_tail(&mw->mw_list, &mw->mw_pool->rb_mws);
 }
 
 static void
+rpcrdma_put_mw_locked(struct rpcrdma_mw *mw)
+{
+   kref_put(&mw->mw_ref, rpcrdma_free_mw);
+}
+
+static void
+rpcrdma_get_mw(struct rpcrdma_mw *mw)
+{
+   kref_get(&mw->mw_ref);
+}
+
+static void
+rpcrdma_put_mw(struct rpcrdma_mw *mw)
+{
+   struct rpcrdma_buffer *buffers = mw->mw_pool;
+   unsigned long flags;
+
+   spin_lock_irqsave(&buffers->rb_lock, flags);
+   rpcrdma_put_mw_locked(mw);
+   spin_unlock_irqrestore(&buffers->rb_lock, flags);
+}
+
+static void
 rpcrdma_buffer_put_mw(struct rpcrdma_mw **mw)
 {
rpcrdma_put_mw_locked(*mw);
@@ -1304,6 +1331,7 @@ rpcrdma_buffer_get_mws(struct rpcrdma_req *req, struct 
rpcrdma_buffer *buffers)
r = list_entry(buffers->rb_mws.next,
struct rpcrdma_mw, mw_list);
list_del(&r->mw_list);
+   kref_init(&r->mw_ref);
r->mw_pool = buffers;
req->rl_segments[i].mr_chunk.rl_mw = r;
}
@@ -1583,6 +1611,7 @@ rpcrdma_register_frmr_external(struct rpcrdma_mr_seg *seg,
dprintk("RPC:   %s: Using frmr %p to map %d segments\n",
__func__, seg1->mr_chunk.rl_mw, i);
 
+   rpcrdma_get_mw(seg1->mr_chunk.rl_mw);
if (unlikely(seg1->mr_chunk.rl_mw->r.frmr.fr_state == FRMR_IS_VALID)) {
dprintk("RPC:   %s: frmr %x left valid, posting 
invalidate.\n",
__func__,
@@ -1595,6 +1624,7 @@ rpcrdma_register_frmr_external(struct rpcrdma_mr_seg *seg,
invalidate_wr.send_flags = IB_SEND_SIGNALED;
invalidate_wr.ex.invalidate_rkey =
seg1->mr_chunk.rl_mw->r.frmr.fr_mr->rkey;
+   rpcrdma_get_mw(seg1->mr_chunk.rl_mw);
DECR_CQCOUNT(&r_xprt->rx_ep);
post_wr = &invalidate_wr;
} else
@@ -1638,6 +1668,9 @@ rpc

[PATCH v1 09/13] xprtrdma: Refactor rpcrdma_buffer_put()

2014-06-23 Thread Chuck Lever
Split out the code that manages the rb_mws list.

A little extra error checking is introduced in the code path that
grabs MWs for the next RPC request. If rb_mws were ever to become
empty, the list_entry() would cause a NULL pointer dereference.

Instead, now rpcrdma_buffer_get() returns NULL, which causes
call_allocate() to delay and try again.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/verbs.c |  105 +++
 net/sunrpc/xprtrdma/xprt_rdma.h |1 
 2 files changed, 74 insertions(+), 32 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 3efc007..f24f0bf 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1251,6 +1251,69 @@ rpcrdma_buffer_destroy(struct rpcrdma_buffer *buf)
kfree(buf->rb_pool);
 }
 
+static void
+rpcrdma_put_mw_locked(struct rpcrdma_mw *mw)
+{
+   list_add_tail(&mw->mw_list, &mw->mw_pool->rb_mws);
+}
+
+static void
+rpcrdma_buffer_put_mw(struct rpcrdma_mw **mw)
+{
+   rpcrdma_put_mw_locked(*mw);
+   *mw = NULL;
+}
+
+/* Cycle mw's back in reverse order, and "spin" them.
+ * This delays and scrambles reuse as much as possible.
+ */
+static void
+rpcrdma_buffer_put_mws(struct rpcrdma_req *req)
+{
+   struct rpcrdma_mr_seg *seg1 = req->rl_segments;
+   struct rpcrdma_mr_seg *seg = seg1;
+   int i;
+
+   for (i = 1, seg++; i < RPCRDMA_MAX_SEGS; seg++, i++)
+   rpcrdma_buffer_put_mw(&seg->mr_chunk.rl_mw);
+   rpcrdma_buffer_put_mw(&seg1->mr_chunk.rl_mw);
+}
+
+static void
+rpcrdma_send_buffer_put(struct rpcrdma_req *req, struct rpcrdma_buffer 
*buffers)
+{
+   buffers->rb_send_bufs[--buffers->rb_send_index] = req;
+   req->rl_niovs = 0;
+   if (req->rl_reply) {
+   buffers->rb_recv_bufs[--buffers->rb_recv_index] = req->rl_reply;
+   req->rl_reply->rr_func = NULL;
+   req->rl_reply = NULL;
+   }
+}
+
+static struct rpcrdma_req *
+rpcrdma_buffer_get_mws(struct rpcrdma_req *req, struct rpcrdma_buffer *buffers)
+{
+   struct rpcrdma_mw *r;
+   int i;
+
+   for (i = RPCRDMA_MAX_SEGS - 1; i >= 0; i--) {
+   if (list_empty(&buffers->rb_mws))
+   goto out_empty;
+
+   r = list_entry(buffers->rb_mws.next,
+   struct rpcrdma_mw, mw_list);
+   list_del(&r->mw_list);
+   r->mw_pool = buffers;
+   req->rl_segments[i].mr_chunk.rl_mw = r;
+   }
+   return req;
+out_empty:
+   rpcrdma_send_buffer_put(req, buffers);
+   rpcrdma_buffer_put_mws(req);
+   return NULL;
+}
+
 /*
  * Get a set of request/reply buffers.
  *
@@ -1263,10 +1326,9 @@ rpcrdma_buffer_destroy(struct rpcrdma_buffer *buf)
 struct rpcrdma_req *
 rpcrdma_buffer_get(struct rpcrdma_buffer *buffers)
 {
+   struct rpcrdma_ia *ia = rdmab_to_ia(buffers);
struct rpcrdma_req *req;
unsigned long flags;
-   int i;
-   struct rpcrdma_mw *r;
 
spin_lock_irqsave(&buffers->rb_lock, flags);
if (buffers->rb_send_index == buffers->rb_max_requests) {
@@ -1286,14 +1348,13 @@ rpcrdma_buffer_get(struct rpcrdma_buffer *buffers)
buffers->rb_recv_bufs[buffers->rb_recv_index++] = NULL;
}
buffers->rb_send_bufs[buffers->rb_send_index++] = NULL;
-   if (!list_empty(&buffers->rb_mws)) {
-   i = RPCRDMA_MAX_SEGS - 1;
-   do {
-   r = list_entry(buffers->rb_mws.next,
-   struct rpcrdma_mw, mw_list);
-   list_del(&r->mw_list);
-   req->rl_segments[i].mr_chunk.rl_mw = r;
-   } while (--i >= 0);
+   switch (ia->ri_memreg_strategy) {
+   case RPCRDMA_FRMR:
+   case RPCRDMA_MTHCAFMR:
+   req = rpcrdma_buffer_get_mws(req, buffers);
+   break;
+   default:
+   break;
}
spin_unlock_irqrestore(&buffers->rb_lock, flags);
return req;
@@ -1308,34 +1369,14 @@ rpcrdma_buffer_put(struct rpcrdma_req *req)
 {
struct rpcrdma_buffer *buffers = req->rl_buffer;
struct rpcrdma_ia *ia = rdmab_to_ia(buffers);
-   int i;
unsigned long flags;
 
spin_lock_irqsave(&buffers->rb_lock, flags);
-   buffers->rb_send_bufs[--buffers->rb_send_index] = req;
-   req->rl_niovs = 0;
-   if (req->rl_reply) {
-   buffers->rb_recv_bufs[--buffers->rb_recv_index] = req->rl_reply;
-   req->rl_reply->rr_func = NULL;
-   req->rl_reply = NULL;
-   }
+   rpcrdma_send_buffer_put(req, buffers);
switch (ia->ri_memreg_strategy) {
case RPCRDMA_FRMR:
case RPCRDMA_MTHCAFMR:
-   /*
-* Cycle mw's back in reverse order, and "spin" them.
-* This delays and scrambles reuse as much as possible.
-*/
-   i = 1;
-

[PATCH v1 03/13] xprtrdma: Limit data payload size for ALLPHYSICAL

2014-06-23 Thread Chuck Lever
When the client uses physical memory registration, each page in the
payload gets its own array entry in the RPC/RDMA header's chunk list.

Therefore, don't advertise a maximum payload size that would require
more array entries than can fit in the RPC buffer where RPC/RDMA
headers are built.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=248
Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/transport.c |4 +++-
 net/sunrpc/xprtrdma/verbs.c |   41 +++
 net/sunrpc/xprtrdma/xprt_rdma.h |1 +
 3 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 66f91f0..4185102 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -296,7 +296,6 @@ xprt_setup_rdma(struct xprt_create *args)
 
xprt->resvport = 0; /* privileged port not needed */
xprt->tsh_size = 0; /* RPC-RDMA handles framing */
-   xprt->max_payload = RPCRDMA_MAX_DATA_SEGS * PAGE_SIZE;
xprt->ops = &xprt_rdma_procs;
 
/*
@@ -382,6 +381,9 @@ xprt_setup_rdma(struct xprt_create *args)
new_ep->rep_xprt = xprt;
 
xprt_rdma_format_addresses(xprt);
+   xprt->max_payload = rpcrdma_max_payload(new_xprt);
+   dprintk("RPC:   %s: transport data payload maximum: %zu bytes\n",
+   __func__, xprt->max_payload);
 
if (!try_module_get(THIS_MODULE))
goto out4;
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index f70b8ad..3c7f904 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1819,3 +1819,44 @@ rpcrdma_ep_post_recv(struct rpcrdma_ia *ia,
rc);
return rc;
 }
+
+/* Physical mapping means one Read/Write list entry per-page.
+ * All list entries must fit within an inline buffer
+ *
+ * NB: The server must return a Write list for NFS READ,
+ * which has the same constraint. Factor in the inline
+ * rsize as well.
+ */
+static size_t
+rpcrdma_physical_max_payload(struct rpcrdma_xprt *r_xprt)
+{
+   struct rpcrdma_create_data_internal *cdata = &r_xprt->rx_data;
+   unsigned int inline_size, pages;
+
+   inline_size = min_t(unsigned int,
+   cdata->inline_wsize, cdata->inline_rsize) -
+   RPCRDMA_HDRLEN_MIN;
+   pages = inline_size / sizeof(struct rpcrdma_segment);
+   return pages << PAGE_SHIFT;
+}
+
+static size_t
+rpcrdma_mr_max_payload(struct rpcrdma_xprt *r_xprt)
+{
+   return RPCRDMA_MAX_DATA_SEGS << PAGE_SHIFT;
+}
+
+size_t
+rpcrdma_max_payload(struct rpcrdma_xprt *r_xprt)
+{
+   size_t result;
+
+   switch (r_xprt->rx_ia.ri_memreg_strategy) {
+   case RPCRDMA_ALLPHYSICAL:
+   result = rpcrdma_physical_max_payload(r_xprt);
+   break;
+   default:
+   result = rpcrdma_mr_max_payload(r_xprt);
+   }
+   return result;
+}
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 97ca516..f3d86b2 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -348,6 +348,7 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *);
  * RPC/RDMA protocol calls - xprtrdma/rpc_rdma.c
  */
 int rpcrdma_marshal_req(struct rpc_rqst *);
+size_t rpcrdma_max_payload(struct rpcrdma_xprt *);
 
 /* Temporary NFS request map cache. Created in svc_rdma.c  */
 extern struct kmem_cache *svc_rdma_map_cachep;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 07/13] xprtrdma: Encode Work Request opcode in wc->wr_id

2014-06-23 Thread Chuck Lever
The wc->opcode field is unreliable when a completion fails.
Up until now, the completion handler has ignored unsuccessful
completions, so that didn't matter to xprtrdma.

In a subsequent patch, however, the send CQ handler will need
to know which Work Request opcode is completing, even if for
error completions.

xprtrdma posts three Work Request opcodes via the send queue:
SEND, FAST_REG_MR, and LOCAL_INV:

For SEND, wc->wr_id is zero. Those completions are ignored.

The other two plant a pointer to an rpcrdma_mw in wc->wr_id. Make
the low-order bit indicate which of FAST_REG_MR or LOCAL_INV is
being done.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/verbs.c |   19 +++
 net/sunrpc/xprtrdma/xprt_rdma.h |2 ++
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index e8ed81c..cef67fd 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -145,20 +145,22 @@ rpcrdma_cq_async_error_upcall(struct ib_event *event, 
void *context)
 static void
 rpcrdma_sendcq_process_wc(struct ib_wc *wc)
 {
-   struct rpcrdma_mw *frmr = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
+   unsigned long wrid = wc->wr_id;
+   struct rpcrdma_mw *mw;
+   int fastreg;
 
-   dprintk("RPC:   %s: frmr %p status %X opcode %d\n",
-   __func__, frmr, wc->status, wc->opcode);
+   dprintk("RPC:   %s: wr_id %lx status %X opcode %d\n",
+   __func__, wrid, wc->status, wc->opcode);
 
-   if (wc->wr_id == 0ULL)
+   if (wrid == 0)
return;
if (wc->status != IB_WC_SUCCESS)
return;
 
-   if (wc->opcode == IB_WC_FAST_REG_MR)
-   frmr->r.frmr.fr_state = FRMR_IS_VALID;
-   else if (wc->opcode == IB_WC_LOCAL_INV)
-   frmr->r.frmr.fr_state = FRMR_IS_INVALID;
+   fastreg = test_and_clear_bit(RPCRDMA_BIT_FASTREG, &wrid);
+   mw = (struct rpcrdma_mw *)wrid;
+
+   mw->r.frmr.fr_state = fastreg ? FRMR_IS_VALID : FRMR_IS_INVALID;
 }
 
 static int
@@ -1538,6 +1540,7 @@ rpcrdma_register_frmr_external(struct rpcrdma_mr_seg *seg,
/* Prepare FRMR WR */
memset(&frmr_wr, 0, sizeof frmr_wr);
frmr_wr.wr_id = (unsigned long)(void *)seg1->mr_chunk.rl_mw;
+   frmr_wr.wr_id |= (u64)1 << RPCRDMA_BIT_FASTREG;
frmr_wr.opcode = IB_WR_FAST_REG_MR;
frmr_wr.send_flags = IB_SEND_SIGNALED;
frmr_wr.wr.fast_reg.iova_start = seg1->mr_dma;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 28c8570..6b5d243 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -177,6 +177,8 @@ struct rpcrdma_mw {
struct list_headmw_list;
 };
 
+#define RPCRDMA_BIT_FASTREG(0)
+
 /*
  * struct rpcrdma_req -- structure central to the request/reply sequence.
  *

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 05/13] xprtrdma: Don't drain CQs on transport disconnect

2014-06-23 Thread Chuck Lever
CQs are not destroyed until unmount. By draining CQs on transport
disconnect, successful completions that can change the r.frmr.state
field can be missed.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/verbs.c |5 -
 1 file changed, 5 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 3c7f904..451e100 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -873,9 +873,6 @@ retry:
dprintk("RPC:   %s: rpcrdma_ep_disconnect"
" status %i\n", __func__, rc);
 
-   rpcrdma_clean_cq(ep->rep_attr.recv_cq);
-   rpcrdma_clean_cq(ep->rep_attr.send_cq);
-
xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
id = rpcrdma_create_id(xprt, ia,
(struct sockaddr *)&xprt->rx_data.addr);
@@ -985,8 +982,6 @@ rpcrdma_ep_disconnect(struct rpcrdma_ep *ep, struct 
rpcrdma_ia *ia)
 {
int rc;
 
-   rpcrdma_clean_cq(ep->rep_attr.recv_cq);
-   rpcrdma_clean_cq(ep->rep_attr.send_cq);
rc = rdma_disconnect(ia->ri_id);
if (!rc) {
/* returns without wait if not connected */

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 04/13] xprtrdma: Update rkeys after transport reconnect

2014-06-23 Thread Chuck Lever
Various reports of:

  rpcrdma_qp_async_error_upcall: QP error 3 on device mlx4_0
ep 8800bfd3e848

Ensure that rkeys in already-marshalled RPC/RDMA headers are
refreshed after the QP has been replaced by a reconnect.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=249
Suggested-by: Selvin Xavier 
Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/rpc_rdma.c  |   77 ---
 net/sunrpc/xprtrdma/transport.c |   11 +++---
 net/sunrpc/xprtrdma/xprt_rdma.h |   10 +
 3 files changed, 56 insertions(+), 42 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 693966d..6eeb6d2 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -53,14 +53,6 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
-enum rpcrdma_chunktype {
-   rpcrdma_noch = 0,
-   rpcrdma_readch,
-   rpcrdma_areadch,
-   rpcrdma_writech,
-   rpcrdma_replych
-};
-
 #ifdef RPC_DEBUG
 static const char transfertypes[][12] = {
"pure inline",  /* no chunks */
@@ -286,6 +278,30 @@ out:
 }
 
 /*
+ * Marshal chunks. This routine returns the header length
+ * consumed by marshaling.
+ *
+ * Returns positive RPC/RDMA header size, or negative errno.
+ */
+
+ssize_t
+rpcrdma_marshal_chunks(struct rpc_rqst *rqst, ssize_t result)
+{
+   struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
+   struct rpcrdma_msg *headerp = (struct rpcrdma_msg *)req->rl_base;
+
+   if (req->rl_rtype != rpcrdma_noch) {
+   result = rpcrdma_create_chunks(rqst,
+   &rqst->rq_snd_buf, headerp, req->rl_rtype);
+
+   } else if (req->rl_wtype != rpcrdma_noch) {
+   result = rpcrdma_create_chunks(rqst,
+   &rqst->rq_rcv_buf, headerp, req->rl_wtype);
+   }
+   return result;
+}
+
+/*
  * Copy write data inline.
  * This function is used for "small" requests. Data which is passed
  * to RPC via iovecs (or page list) is copied directly into the
@@ -377,7 +393,6 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
char *base;
size_t rpclen, padlen;
ssize_t hdrlen;
-   enum rpcrdma_chunktype rtype, wtype;
struct rpcrdma_msg *headerp;
 
/*
@@ -415,13 +430,13 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
 * into pages; otherwise use reply chunks.
 */
if (rqst->rq_rcv_buf.buflen <= RPCRDMA_INLINE_READ_THRESHOLD(rqst))
-   wtype = rpcrdma_noch;
+   req->rl_wtype = rpcrdma_noch;
else if (rqst->rq_rcv_buf.page_len == 0)
-   wtype = rpcrdma_replych;
+   req->rl_wtype = rpcrdma_replych;
else if (rqst->rq_rcv_buf.flags & XDRBUF_READ)
-   wtype = rpcrdma_writech;
+   req->rl_wtype = rpcrdma_writech;
else
-   wtype = rpcrdma_replych;
+   req->rl_wtype = rpcrdma_replych;
 
/*
 * Chunks needed for arguments?
@@ -438,16 +453,16 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
 * TBD check NFSv4 setacl
 */
if (rqst->rq_snd_buf.len <= RPCRDMA_INLINE_WRITE_THRESHOLD(rqst))
-   rtype = rpcrdma_noch;
+   req->rl_rtype = rpcrdma_noch;
else if (rqst->rq_snd_buf.page_len == 0)
-   rtype = rpcrdma_areadch;
+   req->rl_rtype = rpcrdma_areadch;
else
-   rtype = rpcrdma_readch;
+   req->rl_rtype = rpcrdma_readch;
 
/* The following simplification is not true forever */
-   if (rtype != rpcrdma_noch && wtype == rpcrdma_replych)
-   wtype = rpcrdma_noch;
-   if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
+   if (req->rl_rtype != rpcrdma_noch && req->rl_wtype == rpcrdma_replych)
+   req->rl_wtype = rpcrdma_noch;
+   if (req->rl_rtype != rpcrdma_noch && req->rl_wtype != rpcrdma_noch) {
dprintk("RPC:   %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
@@ -461,7 +476,7 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
 * When padding is in use and applies to the transfer, insert
 * it and change the message type.
 */
-   if (rtype == rpcrdma_noch) {
+   if (req->rl_rtype == rpcrdma_noch) {
 
padlen = rpcrdma_inline_pullup(rqst,
RPCRDMA_INLINE_PAD_VALUE(rqst));
@@ -476,7 +491,7 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
headerp->rm_body.rm_padded.rm_pempty[1] = xdr_zero;
headerp->rm_body.rm_padded.rm_pempty[2] = xdr_zero;
hdrlen += 2 * sizeof(u32); /* extra words in padhdr */
-   if (wtype != rpcrdma_noch) {
+   if (req->rl_wtype != rpcrdma_noch) {
dprintk("RPC:   %s: invalid chunk list\n",
  

[PATCH v1 08/13] xprtrdma: Back off rkey when FAST_REG_MR fails

2014-06-23 Thread Chuck Lever
If posting a FAST_REG_MR Work Reqeust fails, or the FAST_REG WR
flushes, revert the rkey update to avoid subsequent
IB_WC_MW_BIND_ERR completions.

Suggested-by: Steve Wise 
Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/verbs.c |   39 +--
 1 file changed, 29 insertions(+), 10 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index cef67fd..3efc007 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -61,6 +61,8 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
+static void rpcrdma_decrement_frmr_rkey(struct rpcrdma_mw *);
+
 /*
  * internal functions
  */
@@ -154,13 +156,17 @@ rpcrdma_sendcq_process_wc(struct ib_wc *wc)
 
if (wrid == 0)
return;
-   if (wc->status != IB_WC_SUCCESS)
-   return;
 
fastreg = test_and_clear_bit(RPCRDMA_BIT_FASTREG, &wrid);
mw = (struct rpcrdma_mw *)wrid;
 
-   mw->r.frmr.fr_state = fastreg ? FRMR_IS_VALID : FRMR_IS_INVALID;
+   if (wc->status == IB_WC_SUCCESS) {
+   mw->r.frmr.fr_state = fastreg ?
+   FRMR_IS_VALID : FRMR_IS_INVALID;
+   } else {
+   if (fastreg)
+   rpcrdma_decrement_frmr_rkey(mw);
+   }
 }
 
 static int
@@ -1480,6 +1486,24 @@ rpcrdma_unmap_one(struct rpcrdma_ia *ia, struct 
rpcrdma_mr_seg *seg)
seg->mr_dma, seg->mr_dmalen, seg->mr_dir);
 }
 
+static void
+rpcrdma_increment_frmr_rkey(struct rpcrdma_mw *mw)
+{
+   struct ib_mr *frmr = mw->r.frmr.fr_mr;
+   u8 key = frmr->rkey & 0x00FF;
+
+   ib_update_fast_reg_key(frmr, ++key);
+}
+
+static void
+rpcrdma_decrement_frmr_rkey(struct rpcrdma_mw *mw)
+{
+   struct ib_mr *frmr = mw->r.frmr.fr_mr;
+   u8 key = frmr->rkey & 0x00FF;
+
+   ib_update_fast_reg_key(frmr, --key);
+}
+
 static int
 rpcrdma_register_frmr_external(struct rpcrdma_mr_seg *seg,
int *nsegs, int writing, struct rpcrdma_ia *ia,
@@ -1487,8 +1511,6 @@ rpcrdma_register_frmr_external(struct rpcrdma_mr_seg *seg,
 {
struct rpcrdma_mr_seg *seg1 = seg;
struct ib_send_wr invalidate_wr, frmr_wr, *bad_wr, *post_wr;
-
-   u8 key;
int len, pageoff;
int i, rc;
int seg_len;
@@ -1552,14 +1574,10 @@ rpcrdma_register_frmr_external(struct rpcrdma_mr_seg 
*seg,
rc = -EIO;
goto out_err;
}
-
-   /* Bump the key */
-   key = (u8)(seg1->mr_chunk.rl_mw->r.frmr.fr_mr->rkey & 0x00FF);
-   ib_update_fast_reg_key(seg1->mr_chunk.rl_mw->r.frmr.fr_mr, ++key);
-
frmr_wr.wr.fast_reg.access_flags = (writing ?
IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
IB_ACCESS_REMOTE_READ);
+   rpcrdma_increment_frmr_rkey(seg1->mr_chunk.rl_mw);
frmr_wr.wr.fast_reg.rkey = seg1->mr_chunk.rl_mw->r.frmr.fr_mr->rkey;
DECR_CQCOUNT(&r_xprt->rx_ep);
 
@@ -1568,6 +1586,7 @@ rpcrdma_register_frmr_external(struct rpcrdma_mr_seg *seg,
if (rc) {
dprintk("RPC:   %s: failed ib_post_send for register,"
" status %i\n", __func__, rc);
+   rpcrdma_decrement_frmr_rkey(seg1->mr_chunk.rl_mw);
goto out_err;
} else {
seg1->mr_rkey = seg1->mr_chunk.rl_mw->r.frmr.fr_mr->rkey;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 06/13] xprtrdma: Unclutter struct rpcrdma_mr_seg

2014-06-23 Thread Chuck Lever
Clean ups:
 - make it obvious that the rl_mw field is a pointer -- allocated
   separately, not as part of struct rpcrdma_mr_seg
 - promote "struct {} frmr;" to a named type
 - promote the state enum to a named type
 - name the MW state field the same way other fields in
   rpcrdma_mw are named

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/verbs.c |6 +++--
 net/sunrpc/xprtrdma/xprt_rdma.h |   44 +--
 2 files changed, 36 insertions(+), 14 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 451e100..e8ed81c 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -156,9 +156,9 @@ rpcrdma_sendcq_process_wc(struct ib_wc *wc)
return;
 
if (wc->opcode == IB_WC_FAST_REG_MR)
-   frmr->r.frmr.state = FRMR_IS_VALID;
+   frmr->r.frmr.fr_state = FRMR_IS_VALID;
else if (wc->opcode == IB_WC_LOCAL_INV)
-   frmr->r.frmr.state = FRMR_IS_INVALID;
+   frmr->r.frmr.fr_state = FRMR_IS_INVALID;
 }
 
 static int
@@ -1518,7 +1518,7 @@ rpcrdma_register_frmr_external(struct rpcrdma_mr_seg *seg,
dprintk("RPC:   %s: Using frmr %p to map %d segments\n",
__func__, seg1->mr_chunk.rl_mw, i);
 
-   if (unlikely(seg1->mr_chunk.rl_mw->r.frmr.state == FRMR_IS_VALID)) {
+   if (unlikely(seg1->mr_chunk.rl_mw->r.frmr.fr_state == FRMR_IS_VALID)) {
dprintk("RPC:   %s: frmr %x left valid, posting 
invalidate.\n",
__func__,
seg1->mr_chunk.rl_mw->r.frmr.fr_mr->rkey);
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index c270e59..28c8570 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -146,6 +146,38 @@ struct rpcrdma_rep {
 };
 
 /*
+ * struct rpcrdma_mw - external memory region metadata
+ *
+ * An external memory region is any buffer or page that is registered
+ * on the fly (ie, not pre-registered).
+ *
+ * Each rpcrdma_buffer has a list of these anchored in rb_mws. During
+ * call_allocate, rpcrdma_buffer_get() assigns one to each segment in
+ * an rpcrdma_req. Then rpcrdma_register_external() grabs these to keep
+ * track of registration metadata while each RPC is pending.
+ * rpcrdma_deregister_external() uses this metadata to unmap and
+ * release these resources when an RPC is complete.
+ */
+enum rpcrdma_frmr_state {
+   FRMR_IS_INVALID,
+   FRMR_IS_VALID,
+};
+
+struct rpcrdma_frmr {
+   struct ib_fast_reg_page_list*fr_pgl;
+   struct ib_mr*fr_mr;
+   enum rpcrdma_frmr_state fr_state;
+};
+
+struct rpcrdma_mw {
+   union {
+   struct ib_fmr   *fmr;
+   struct rpcrdma_frmr frmr;
+   } r;
+   struct list_headmw_list;
+};
+
+/*
  * struct rpcrdma_req -- structure central to the request/reply sequence.
  *
  * N of these are associated with a transport instance, and stored in
@@ -172,17 +204,7 @@ struct rpcrdma_rep {
 struct rpcrdma_mr_seg {/* chunk descriptors */
union { /* chunk memory handles */
struct ib_mr*rl_mr; /* if registered directly */
-   struct rpcrdma_mw { /* if registered from region */
-   union {
-   struct ib_fmr   *fmr;
-   struct {
-   struct ib_fast_reg_page_list *fr_pgl;
-   struct ib_mr *fr_mr;
-   enum { FRMR_IS_INVALID, FRMR_IS_VALID  
} state;
-   } frmr;
-   } r;
-   struct list_head mw_list;
-   } *rl_mw;
+   struct rpcrdma_mw *rl_mw;   /* if registered from region */
} mr_chunk;
u64 mr_base;/* registration result */
u32 mr_rkey;/* registration result */

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 02/13] xprtrdma: Protect ->qp during FRMR deregistration

2014-06-23 Thread Chuck Lever
Ensure the QP remains valid while posting LOCAL_INV during a
transport reconnect. Otherwise, ia->ri_id->qp is NULL, which
triggers a panic.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=259
Fixes: ec62f40d3505a643497d105c297093bb90afd44e
Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/verbs.c |   14 +++---
 net/sunrpc/xprtrdma/xprt_rdma.h |1 +
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 78bd7c6..f70b8ad 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -613,6 +613,7 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr 
*addr, int memreg)
/* Else will do memory reg/dereg for each chunk */
ia->ri_memreg_strategy = memreg;
 
+   rwlock_init(&ia->ri_qplock);
return 0;
 out2:
rdma_destroy_id(ia->ri_id);
@@ -859,7 +860,7 @@ rpcrdma_ep_destroy(struct rpcrdma_ep *ep, struct rpcrdma_ia 
*ia)
 int
 rpcrdma_ep_connect(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
 {
-   struct rdma_cm_id *id;
+   struct rdma_cm_id *id, *old;
int rc = 0;
int retry_count = 0;
 
@@ -905,9 +906,14 @@ retry:
rc = -ENETUNREACH;
goto out;
}
-   rdma_destroy_qp(ia->ri_id);
-   rdma_destroy_id(ia->ri_id);
+
+   write_lock(&ia->ri_qplock);
+   old = ia->ri_id;
ia->ri_id = id;
+   write_unlock(&ia->ri_qplock);
+
+   rdma_destroy_qp(old);
+   rdma_destroy_id(old);
} else {
dprintk("RPC:   %s: connecting...\n", __func__);
rc = rdma_create_qp(ia->ri_id, ia->ri_pd, &ep->rep_attr);
@@ -1597,7 +1603,9 @@ rpcrdma_deregister_frmr_external(struct rpcrdma_mr_seg 
*seg,
invalidate_wr.ex.invalidate_rkey = 
seg1->mr_chunk.rl_mw->r.frmr.fr_mr->rkey;
DECR_CQCOUNT(&r_xprt->rx_ep);
 
+   read_lock(&ia->ri_qplock);
rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
+   read_unlock(&ia->ri_qplock);
if (rc)
dprintk("RPC:   %s: failed ib_post_send for invalidate,"
" status %i\n", __func__, rc);
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 89e7cd4..97ca516 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -59,6 +59,7 @@
  * Interface Adapter -- one per transport instance
  */
 struct rpcrdma_ia {
+   rwlock_tri_qplock;
struct rdma_cm_id   *ri_id;
struct ib_pd*ri_pd;
struct ib_mr*ri_bind_mem;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 01/13] xprtrdma: Fix panic in rpcrdma_register_frmr_external()

2014-06-23 Thread Chuck Lever
seg1->mr_nsegs is not yet initialized when it is used to unmap
segments during an error exit. Use the same unmapping logic for
all error exits.

"if (frmr_wr.wr.fast_reg.length < len) {" used to be a BUG_ON check.
The broken code should never be executed under normal operation.

Fixes: c977dea22708688eae31774f70126c97aa4dfe83
Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/verbs.c |   12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 13dbd1c..78bd7c6 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1545,9 +1545,8 @@ rpcrdma_register_frmr_external(struct rpcrdma_mr_seg *seg,
frmr_wr.wr.fast_reg.page_shift = PAGE_SHIFT;
frmr_wr.wr.fast_reg.length = page_no << PAGE_SHIFT;
if (frmr_wr.wr.fast_reg.length < len) {
-   while (seg1->mr_nsegs--)
-   rpcrdma_unmap_one(ia, seg++);
-   return -EIO;
+   rc = -EIO;
+   goto out_err;
}
 
/* Bump the key */
@@ -1565,8 +1564,7 @@ rpcrdma_register_frmr_external(struct rpcrdma_mr_seg *seg,
if (rc) {
dprintk("RPC:   %s: failed ib_post_send for register,"
" status %i\n", __func__, rc);
-   while (i--)
-   rpcrdma_unmap_one(ia, --seg);
+   goto out_err;
} else {
seg1->mr_rkey = seg1->mr_chunk.rl_mw->r.frmr.fr_mr->rkey;
seg1->mr_base = seg1->mr_dma + pageoff;
@@ -1574,6 +1572,10 @@ rpcrdma_register_frmr_external(struct rpcrdma_mr_seg 
*seg,
seg1->mr_len = len;
}
*nsegs = i;
+   return 0;
+out_err:
+   while (i--)
+   rpcrdma_unmap_one(ia, --seg);
return rc;
 }
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 00/13] NFS/RDMA patches for 3.17

2014-06-23 Thread Chuck Lever
The main purpose of this series is to address more connection drop
recovery issues by fixing FRMR re-use to make it less likely the
client will drop the connection due to a memory operation error.

Some other clean-ups and fixes are present as well.

See topic branch nfs-rdma-for-3.17 in

  git://git.linux-nfs.org/projects/cel/cel-2.6.git

I tested with NFSv3 and NFSv4 on all three supported memory
registration modes. Used cthon04 and iozone with both Solaris
and Linux NFS/RDMA servers. Used xfstests with Linux.

---

Chuck Lever (13):
  xprtrdma: Fix panic in rpcrdma_register_frmr_external()
  xprtrdma: Protect ->qp during FRMR deregistration
  xprtrdma: Limit data payload size for ALLPHYSICAL
  xprtrdma: Update rkeys after transport reconnect
  xprtrdma: Don't drain CQs on transport disconnect
  xprtrdma: Unclutter struct rpcrdma_mr_seg
  xprtrdma: Encode Work Request opcode in wc->wr_id
  xprtrdma: Back off rkey when FAST_REG_MR fails
  xprtrdma: Refactor rpcrdma_buffer_put()
  xprtrdma: Release FRMR segment buffers during LOCAL_INV completion
  xprtrdma: Clean up rpcrdma_ep_disconnect()
  xprtrdma: Remove RPCRDMA_PERSISTENT_REGISTRATION macro
  xprtrdma: Handle additional connection events


 include/linux/sunrpc/xprtrdma.h |2 
 net/sunrpc/xprtrdma/rpc_rdma.c  |   77 +
 net/sunrpc/xprtrdma/transport.c |   17 +-
 net/sunrpc/xprtrdma/verbs.c |  330 +++
 net/sunrpc/xprtrdma/xprt_rdma.h |   63 ++-
 5 files changed, 332 insertions(+), 157 deletions(-)

--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/22] Add and use pci_zalloc_consistent

2014-06-23 Thread David Miller
From: Joe Perches 
Date: Mon, 23 Jun 2014 06:41:28 -0700

> Adding the helper reduces object code size as well as overall
> source size line count.
> 
> It's also consistent with all the various zalloc mechanisms
> in the kernel.
> 
> Done with a simple cocci script and some typing.

For networking bits:

Acked-by: David S. Miller 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/3] scsi_cmnd: Introduce scsi_transfer_length helper

2014-06-23 Thread Mike Christie
On 06/11/2014 04:09 AM, Sagi Grimberg wrote:
> In case protection information exists on the wire
> scsi transports should include it in the transfer
> byte count (even if protection information does not
> exist in the host memory space). This helper will
> compute the total transfer length from the scsi
> command data length and protection attributes.
> 
> Signed-off-by: Sagi Grimberg 
> Signed-off-by: Martin K. Petersen 
> ---
>  include/scsi/scsi_cmnd.h |   17 +
>  1 files changed, 17 insertions(+), 0 deletions(-)
> 
> diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h
> index dd7c998..a100c6e 100644
> --- a/include/scsi/scsi_cmnd.h
> +++ b/include/scsi/scsi_cmnd.h
> @@ -7,6 +7,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  struct Scsi_Host;
>  struct scsi_device;
> @@ -306,4 +307,20 @@ static inline void set_driver_byte(struct scsi_cmnd 
> *cmd, char status)
>   cmd->result = (cmd->result & 0x00ff) | (status << 24);
>  }
>  
> +static inline unsigned scsi_transfer_length(struct scsi_cmnd *scmd)
> +{
> + unsigned int xfer_len = blk_rq_bytes(scmd->request);

Can you do bidi and dif/dix? If so, then instead of using blk_rq_bytes
directly should it use the scsi_out/scsi_in macros and access the length
through the scsi_data_buffer?

This does not fix Christoph's bug in the other mail. Just noticed it
while looking at the code.


> + unsigned int prot_op = scsi_get_prot_op(scmd);
> + unsigned int sector_size = scmd->device->sector_size;
> +
> + switch (prot_op) {
> + case SCSI_PROT_NORMAL:
> + case SCSI_PROT_WRITE_STRIP:
> + case SCSI_PROT_READ_INSERT:
> + return xfer_len;
> + }
> +
> + return xfer_len + (xfer_len >> ilog2(sector_size)) * 8;
> +}
> +
>  #endif /* _SCSI_SCSI_CMND_H */
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to re-use a QP for a new connection

2014-06-23 Thread Steve Wise

On 6/23/2014 12:31 PM, Chuck Lever wrote:

On Jun 23, 2014, at 1:25 PM, Hefty, Sean  wrote:


For the record, with both mlx4 and cxgb4, we see FRMRs left valid
after a FAST_REG_MR is flushed during a connection loss. More study
needed, obviously.

Is the bug that this type of WR completes in error, but actually exposed the 
memory region?

We haven’t checked if the MR is exposed; hadn’t thought of that!


I don't think this is a bug.  It is a race where HW is in the process of 
fast-registering the memory at the time the QP is moved out of RTS 
causing all pending work requests to get FLUSHED.  I looked at both the 
IBTA IB and IETF iWARP Verbs specs, and neither state explicitly what 
FLUSHED status means.  They both say "at the the time the QP was moved 
to ERROR the work request was not complete".  That's doesn't indicate 
that the work request was canceled or didn't actually complete.  At 
least that's how I read it.  Irregardless, the chelsio hardware behaves 
this way.  And apparently the mlx hardware does too.


Anyway, for cxgb4 at least, the FRMR can be left in the valid state.  
The correct procedure, in the case of a fast-reg wr completing as 
FLUSHED is to dereg the MR if you want to ensure the region is invalidated.



What we do know is that a subsequent LOCAL_INVALIDATE using the rkey that
should work (if FAST_REG_MR had indeed never been done) fails in some cases.
With mlx4, the LINV completes with IB_WC_MW_BIND_ERR. Steve can provide
more detail about the exact failure mode with cxgb4.


cxgb4 completes with IB_WC_LOC_ACCESS_ERR.

Steve.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] libiscsi, iser: Adjust data_length to include protection information

2014-06-23 Thread Christoph Hellwig
This patch causes a regression when using the iscsi initiator over
TCP for me. When mounting a newly created ext4 filesystem I get the
following BUG: 

[   31.611803] BUG: unable to handle kernel NULL pointer dereference at 
000c
[   31.613563] IP: [] iscsi_tcp_segment_done+0x2bd/0x380
[   31.613563] PGD 7a3e4067 PUD 7a45f067 PMD 0 
[   31.613563] Oops:  [#1] SMP 
[   31.613563] Modules linked in:
[   31.613563] CPU: 3 PID: 3739 Comm: kworker/u8:5 Not tainted 3.16.0-rc2 #187
[   31.613563] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[   31.613563] Workqueue: iscsi_q_2 iscsi_xmitworker
[   31.613563] task: 88007b33cf10 ti: 88007ad94000 task.ti: 
88007ad94000
[   31.613563] RIP: 0010:[]  [] 
iscsi_tcp_segment_done+0x2bd/0x380
[   31.613563] RSP: 0018:88007ad97b38  EFLAGS: 00010246
[   31.613563] RAX:  RBX: 88007cd67910 RCX: 0200
[   31.613563] RDX: 2000 RSI:  RDI: 88007cd67910
[   31.613563] RBP: 88007ad97b98 R08: 0200 R09: 
[   31.613563] R10:  R11: 0001 R12: 
[   31.613563] R13: 88007cd67780 R14:  R15: 
[   31.613563] FS:  () GS:88007fd8() 
knlGS:
[   31.613563] CS:  0010 DS:  ES:  CR0: 8005003b
[   31.613563] CR2: 000c CR3: 7afd9000 CR4: 06e0
[   31.613563] Stack:
[   31.613563]  88007ad97b98 81c68fd6 81c68f20 
88007c8e37c8
[   31.613563]  7b33d728 88007dc805b0 88007ad97c58 
0200
[   31.613563]  88007cd67780 88c00040 88007ad97c00 
88007cd67910
[   31.613563] Call Trace:
[   31.613563]  [] ? inet_sendpage+0xb6/0x130
[   31.613563]  [] ? inet_dgram_connect+0x80/0x80
[   31.613563]  [] iscsi_sw_tcp_pdu_xmit+0xe5/0x2e0
[   31.613563]  [] ? iscsi_sw_tcp_pdu_init+0x1bf/0x390
[   31.613563]  [] iscsi_tcp_task_xmit+0xa2/0x2b0
[   31.613563]  [] ? iscsi_xmit_task+0x45/0xd0
[   31.613563]  [] ? trace_hardirqs_on+0xd/0x10
[   31.613563]  [] ? __local_bh_enable_ip+0x70/0xd0
[   31.613563]  [] iscsi_xmit_task+0x59/0xd0
[   31.613563]  [] iscsi_xmitworker+0x288/0x330
[   31.613563]  [] process_one_work+0x1c7/0x490
[   31.613563]  [] ? process_one_work+0x15d/0x490
[   31.613563]  [] worker_thread+0x119/0x4f0
[   31.613563]  [] ? trace_hardirqs_on+0xd/0x10
[   31.613563]  [] ? init_pwq+0x190/0x190
[   31.613563]  [] kthread+0xdf/0x100
[   31.613563]  [] ? __init_kthread_worker+0x70/0x70
[   31.613563]  [] ret_from_fork+0x7c/0xb0
[   31.613563]  [] ? __init_kthread_worker+0x70/0x70
[   31.613563] Code: 89 03 31 c0 e9 cc fe ff ff 0f 1f 44 00 00 48 8b 7b
30 e8 17 74 de ff 8b 53 10 c7 43 40 00 00 00 00 48 89 43 30 44 89 f6 48
89 df <8b> 40 0c 48 c7 03 00 00 00 00 2b 53 14 39 c2 0f 47 d0 89 53 08 


(gdb) l *(iscsi_tcp_segment_done+0x2bd)
0x8197b38d is in iscsi_tcp_segment_done
(../drivers/scsi/libiscsi_tcp.c:102).
97  iscsi_tcp_segment_init_sg(struct iscsi_segment *segment,
98struct scatterlist *sg, unsigned int offset)
99  {
100 segment->sg = sg;
101 segment->sg_offset = offset;
102 segment->size = min(sg->length - offset,
103 segment->total_size - 
segment->total_copied);
104 segment->data = NULL;
105 }
106 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PCI/AER: AER in SRIOV environment

2014-06-23 Thread Don Dutile

On 06/23/2014 03:09 PM, Bjorn Helgaas wrote:

[+cc linux-pci, Don]


Adding Alex Williamson in case he can add more to this conversation...


On Mon, Jun 23, 2014 at 8:29 AM, Yishai Hadas
 wrote:

Hi Vijay,
Trying to add AER support for Mellanox NIC in SRIOV environment, while
evaluating/testing encountered a problem which led me to your
patch accepted as part of kernel 3.8, commit ID
"918b4053184c0ca22236e70e299c5343eea35304".

Have some concerns/questions on:
When working in SRIOV environment VFs may be un-attached, having no driver
assigned to, or may be attached to Virtual machine to work in some
pass-through mode.
Once working in KVM setup there is pci-stub driver which is loaded in the
HYP/PF for a given attached VF.

huh? 'loaded in the hyp/pf?  um, loaded in the host, and a VF is
detached from its host driver -- a VF can be used in the host w/o any 
virtualization,
i.e., that's how guest VM is driving the VF: as if it was used by a guest 
(host) OS directly --
and attached to pci-stub driver, when assigned to a KVM guest in pre-VFIO 
days/ways.
If VFIO used, then VF is attached to vfio-pci driver.



I'm using the aer-inject kernel module and its corresponding aer-inject tool
to simulate an error in the HYP.
In both cases your commit will cause the AER recovery to fail as there is no
driver assigned to PF's VFs that supports AER, comparing the code before
your change.


Without VFIO, I believe that's correct. There was no AER-to-VF support pre-VFIO 
days.
I believe with the recent VFIO support,
and modifications to KVM, an AER that is associated with an assigned VF will
force the crash/halt of the KVM guest -- can't depend on a guest VF driver 
clearing
the AER in the hyp/host -- guest isn't privileged enough to clear the error.
So, crashing the guest is the simple option at the moment, to contain the error.
Alex: do I have that (vfio aer default) correct, or is that still 
site-under-construction?


How such cases should work ?  my expectation was that the PF will get the
error detected message then will recognize whether
issue is its own or one of its VFs

The AER packet will have the tag of the VF in if it was the source of the error;
so the PF will never see it; although one could argue it should be 'promoted'
to the PF if PF/VF needs to clear some state it has wrt the VF (the SRIOV spec 
is
lacking of info in this space); _but_, VFIO resets the VF (sets FLR bit) when 
the
device is deassigned and before re-attachment to the host, so that should clear 
out
any state btwn PF & VF ('should' ... famous last words...).



I'm really not an AER expert, so help me understand this question of
recognizing whether an error is associated with a PF or a VF.

In terms of hardware, it looks like the device that detects an error
logs some information and sends an Error Message upstream.  The Root
Complex receives the message, captures the source ID from the Error
Message, and may generate an interrupt.  I expect this source ID can
be either a PF or a VF; there's no requirement that a VF error must be
reported as though it's from the PF, is there?


and work accordingly, in current code
looks like recovery failed as part of "voting" once there is no AER handler
assigned to the VFs.


The commit you mentioned has to do with PCI_ERS_RESULT_NO_AER_DRIVER.
We use pci_walk_bus() to figure out whether all the devices in a
subtree have a driver.  What subtree is involved here?  I would expect
the VFs to be siblings of the PF, not children of it, so I'm not sure
where things went wrong.

Well, VFs could be on virtual busses (ARI turned on), so not necessarily a
sibling to PF ... and then we have the problem in PCI code of not being able
to traverse these virtual busses (in some cases; not sure if pci_walk_bus(),
which is going down the tree vs up the tree, has any problems here w/VFs on
virtual busses).



Can you collect "lspci -vvv" output and maybe add some debug so we can
see exactly where the error is detected and what devices we're looking
at to conclude that one of them doesn't have a driver?

Bjorn



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ibacm release 1.0.9

2014-06-23 Thread Hefty, Sean
A new release of ibacm - 1.0.9 - is available from:

https://www.openfabrics.org/downloads/rdmacm/ibacm-1.0.9.tar.gz

1.0.9 adds handling for dynamic IP address changes, plus a few bug fixes.

To avoid incorporating the recent provider support changes, the 1.0.9 release 
was creating in a branch. 

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/22] Add and use pci_zalloc_consistent

2014-06-23 Thread Joe Perches
On Mon, 2014-06-23 at 10:25 -0700, Luis R. Rodriguez wrote:
> On Mon, Jun 23, 2014 at 06:41:28AM -0700, Joe Perches wrote:
> > Adding the helper reduces object code size as well as overall
> > source size line count.
> > 
> > It's also consistent with all the various zalloc mechanisms
> > in the kernel.
> > 
> > Done with a simple cocci script and some typing.
> 
> Awesome, any chance you can paste in the SmPL? Also any chance
> we can get this added to a make coccicheck so that maintainers
> moving forward can use that to ensure that no new code is
> added that uses the old school API?

Not many of these are recent.

Arnd Bergmann reasonably suggested that the pci_alloc_consistent
api be converted the the more widely used dma_alloc_coherent.

https://lkml.org/lkml/2014/6/23/513

> Shouldn't these drivers just use the normal dma-mapping API now?

and I replied:

https://lkml.org/lkml/2014/6/23/525

> Maybe.  I wouldn't mind.
> They do seem to have a trivial bit of unnecessary overhead for 
> hwdev == NULL ? NULL : &hwdev->dev

Anyway, here's the little script.
I'm not sure it's worthwhile to add it though.

$ cat ./scripts/coccinelle/api/alloc/pci_zalloc_consistent.cocci
///
/// Use pci_zalloc_consistent rather than
/// pci_alloc_consistent followed by memset with 0
///
/// This considers some simple cases that are common and easy to validate
/// Note in particular that there are no ...s in the rule, so all of the
/// matched code has to be contiguous
///
/// Blatantly cribbed from: scripts/coccinelle/api/alloc/kzalloc-simple.cocci

@@
type T, T2;
expression x;
expression E1,E2,E3;
statement S;
@@

- x = (T)pci_alloc_consistent(E1,E2,E3);
+ x = pci_zalloc_consistent(E1,E2,E3);
  if ((x==NULL) || ...) S
- memset((T2)x,0,E2);


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PCI/AER: AER in SRIOV environment

2014-06-23 Thread Bjorn Helgaas
[+cc linux-pci, Don]

On Mon, Jun 23, 2014 at 8:29 AM, Yishai Hadas
 wrote:
> Hi Vijay,
> Trying to add AER support for Mellanox NIC in SRIOV environment, while
> evaluating/testing encountered a problem which led me to your
> patch accepted as part of kernel 3.8, commit ID
> "918b4053184c0ca22236e70e299c5343eea35304".
>
> Have some concerns/questions on:
> When working in SRIOV environment VFs may be un-attached, having no driver
> assigned to, or may be attached to Virtual machine to work in some
> pass-through mode.
> Once working in KVM setup there is pci-stub driver which is loaded in the
> HYP/PF for a given attached VF.
>
> I'm using the aer-inject kernel module and its corresponding aer-inject tool
> to simulate an error in the HYP.
> In both cases your commit will cause the AER recovery to fail as there is no
> driver assigned to PF's VFs that supports AER, comparing the code before
> your change.
>
> How such cases should work ?  my expectation was that the PF will get the
> error detected message then will recognize whether
> issue is its own or one of its VFs

I'm really not an AER expert, so help me understand this question of
recognizing whether an error is associated with a PF or a VF.

In terms of hardware, it looks like the device that detects an error
logs some information and sends an Error Message upstream.  The Root
Complex receives the message, captures the source ID from the Error
Message, and may generate an interrupt.  I expect this source ID can
be either a PF or a VF; there's no requirement that a VF error must be
reported as though it's from the PF, is there?

> and work accordingly, in current code
> looks like recovery failed as part of "voting" once there is no AER handler
> assigned to the VFs.

The commit you mentioned has to do with PCI_ERS_RESULT_NO_AER_DRIVER.
We use pci_walk_bus() to figure out whether all the devices in a
subtree have a driver.  What subtree is involved here?  I would expect
the VFs to be siblings of the PF, not children of it, so I'm not sure
where things went wrong.

Can you collect "lspci -vvv" output and maybe add some debug so we can
see exactly where the error is detected and what devices we're looking
at to conclude that one of them doesn't have a driver?

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: why flipping responder_resources/initiator_depth?

2014-06-23 Thread Jason Gunthorpe
On Mon, Jun 23, 2014 at 06:00:57PM +, Hefty, Sean wrote:
> > > The swapping and general missing handling of RR negotiating in the
> > > whole kernel CM API (not just RDMA CM, but IB CM too) is a
> > > longstanding bug, and I have written user space code that fixes it up
> > > in the past :(
> > 
> > Jason, the swapping takes place in the IB CM indeed, I just used the
> > wording from the librdmacm man pages to described the desired
> > behaviour as I see it. Did you ever repored to the swapping on this
> > list in the past? when?
> 
> The behavior matches the documentation.  And the problem is...?

The problem is this whole thing is a giant gotcha if you don't
intimitely understand exactly what the spec requires, and naively
assume the kernel does something sane, or even provides you the values
the spec says you need in fields that are named the same as the spec.

If you use the IB CM in userspace you need to hook IB_CM_REQ_RECEIVED
and do something like this:

/* Note, req.responder_resources and req.initiator_depth are swapped
   in the kernel. FIXME: this works around the kernel not implementing
   the negotation procedure by doing it here */
rep.responder_resources = min((int)req.responder_resources,
   devAttr.max_qp_rd_atom);
rep.initiator_depth = min((int)req.initiator_depth,
   devAttr.max_qp_init_rd_atom);

So
 1) The kernel swapped the values before passing them to userspace,
(and other kernel consumers). So this becomes very confusing if
you are not aware that req.responders_resources is not actually
what the IBA spec describes as REQ responderResources.
 2) The kernel doesn't do anything to help implement the IBA sepc
required negotiation, it doesn't limit to HCA values, for instance
after getting a REQ.
 3) There is no aide to help a simple app developer do this right, and
almost everyone I've ever looked at just passes 2 in for both
values and hopes for the best.
 4) Other elements of the negotiation procedure I outlined above seem
to be missing, like the sanity check of the REP, and the
generation of REJ if the values are not acceptable. 

I haven't looked at how this all plays through with RDMA CM. But
looking quickly, I don't see an obvious similar min in cma_connect_ib.

To my mind, the biggest issue is the common code does not seem to make
it easy for apps to correctly implement the IBA negotiation protocol.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: why flipping responder_resources/initiator_depth?

2014-06-23 Thread Hefty, Sean
> > The swapping and general missing handling of RR negotiating in the
> > whole kernel CM API (not just RDMA CM, but IB CM too) is a
> > longstanding bug, and I have written user space code that fixes it up
> > in the past :(
> 
> Jason, the swapping takes place in the IB CM indeed, I just used the
> wording from the librdmacm man pages to described the desired
> behaviour as I see it. Did you ever repored to the swapping on this
> list in the past? when?

The behavior matches the documentation.  And the problem is...?

The initiator_depth and responder_resources must be swapped between the REQ and 
REP.  Why is having the RDMA CM do this swapping an issue?


Re: why flipping responder_resources/initiator_depth?

2014-06-23 Thread Or Gerlitz
On Mon, Jun 23, 2014 at 7:49 PM, Jason Gunthorpe
 wrote:
[...]
> The swapping and general missing handling of RR negotiating in the
> whole kernel CM API (not just RDMA CM, but IB CM too) is a
> longstanding bug, and I have written user space code that fixes it up
> in the past :(

Jason, the swapping takes place in the IB CM indeed, I just used the
wording from the librdmacm man pages to described the desired
behaviour as I see it. Did you ever repored to the swapping on this
list in the past? when?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to re-use a QP for a new connection

2014-06-23 Thread Chuck Lever

On Jun 23, 2014, at 1:25 PM, Hefty, Sean  wrote:

>> For the record, with both mlx4 and cxgb4, we see FRMRs left valid
>> after a FAST_REG_MR is flushed during a connection loss. More study
>> needed, obviously.
> 
> Is the bug that this type of WR completes in error, but actually exposed the 
> memory region?

We haven’t checked if the MR is exposed; hadn’t thought of that!

What we do know is that a subsequent LOCAL_INVALIDATE using the rkey that
should work (if FAST_REG_MR had indeed never been done) fails in some cases.
With mlx4, the LINV completes with IB_WC_MW_BIND_ERR. Steve can provide
more detail about the exact failure mode with cxgb4.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: how to re-use a QP for a new connection

2014-06-23 Thread Hefty, Sean
> For the record, with both mlx4 and cxgb4, we see FRMRs left valid
> after a FAST_REG_MR is flushed during a connection loss. More study
> needed, obviously.

Is the bug that this type of WR completes in error, but actually exposed the 
memory region?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/22] Add and use pci_zalloc_consistent

2014-06-23 Thread Luis R. Rodriguez
On Mon, Jun 23, 2014 at 06:41:28AM -0700, Joe Perches wrote:
> Adding the helper reduces object code size as well as overall
> source size line count.
> 
> It's also consistent with all the various zalloc mechanisms
> in the kernel.
> 
> Done with a simple cocci script and some typing.

Awesome, any chance you can paste in the SmPL? Also any chance
we can get this added to a make coccicheck so that maintainers
moving forward can use that to ensure that no new code is
added that uses the old school API?

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to re-use a QP for a new connection

2014-06-23 Thread Chuck Lever

On Jun 23, 2014, at 12:22 PM, Hefty, Sean  wrote:

>> Steve Wise is helping me with a particular issue where QP re-use might
>> be helpful.
>> 
>> When an RPC/RDMA transport connection is dropped (for example, the NFS
>> server crashes), xprtrdma destroys the transport's QP and creates a
>> new one for the next connection.
> 
> If the remote side crashes, the local QP can transition into the error state, 
> which would flush all posted receives.  I believe that a WR that has 
> completed in error only has the wr_id field valid.
> 
> Note that calling rdma_disconnect() will also transition the QP into the 
> error state.

So on remote disconnect there are two steps:

1. The QP is transitioned to the error state

2. Later, when xprtrdma attempts to reconnect, it’s transport connect
   worker destroys the old QP

I think you and Devesh are suggesting that when the QP is transitioned
to error state in step 1, the provider immediately flushes the send and
completion queues appropriately, leaving no possibility of a completed
WR with a dropped completion.

> 
>> We're not quite sure what IB_WC_WR_FLUSH_ERR means in that instance. Our
>> theory is there is a gap when the old QP is destroyed:
>> 
>> 1. If the HW reports a successful WR completion but the QP no longer
>>   exists, the provider substitutes an IB_WC_WR_FLUSH_ERR completion
>> 
>> 2. If the WR is dropped before the HW even saw it, the provider inserts
>>   an IB_WC_WR_FLUSH_ERR completion
>> 
>> So if xprtrdma is trying to submit a FAST_REG_MR WR and the completion
>> gets flushed, xprtrdma has no way to know whether the rkey was bumped in
>> the adapter. Thus it has no certainty which rkey to use to invalidate
>> that FRMR.
> 
> I'm not familiar with the behavior of fast reg mr.

For the record, with both mlx4 and cxgb4, we see FRMRs left valid
after a FAST_REG_MR is flushed during a connection loss. More study
needed, obviously.

>> I was idly wondering whether re-using the QP during connection loss
>> would provide a guarantee that xprtrdma would never see case 1 above.
>> Then IB_WC_WR_FLUSH_ERR on a FAST_REG_MR WR would be a more certain
>> indication that the HW still has the old rkey.
>> 
>> I suppose that xprtrdma can "hang onto" the QP without re-using it by
>> simply not destroying it until all WRs scheduled on the old QP are
>> completed. Is reference counting the QP the usual design pattern to deal
>> with this case?
> 
> I _thought_ that destroying the QP would cleanup any completion entries in 
> the CQ, but I'm not sure of this.  Referencing counting should work though. 

As a workaround, I can comment out the rdma_destroy_qp() call in xprtrdma's
connect worker to see if there’s any change in behavior when the old QP
stays around.

Given that the queues are flushed on RTS->Error, probably won’t see any
difference at all.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: why flipping responder_resources/initiator_depth?

2014-06-23 Thread Jason Gunthorpe
On Mon, Jun 23, 2014 at 08:55:07AM +0300, Or Gerlitz wrote:
> 1. the client to put into the responder_resources they provide to
> rdma_connect the the maximum number of outstanding RDMA read that they
> will be able accept from the server side
> 
> 2. the server to apply a minimum function between the
> responder_resources which were advertized by the client (and they get
> in the connection request event params) to how many inflight
> rdma-reads  their HCA supports

>From a wire perspective the spec is pretty clear what the CM responder
resources and initiator depth are supposed to be, and the behavior of
#2 is mandated in the spec.

>From a API perspective it makes sense that the only input to the
the API would be 'the initiator depth the caller will use', which is
basically the only thing the caller actually controls. 0 if the client
never uses RDMA READ or ATOMICs, 1 if it is strictly interlocked, and
higher as necessary.

I'm not sure there is a use case to limit QP responder resources at
the caller? Maybe to specify '0' if the caller knows it will never
setup a remote readable MR?

So both sides pass in their desired initiator depth. Both sides limit
that to HCA init depth capabilities. The REQ side plugs that value
into REQ.initiatorDepth and the HCA capability into
REQ.responderResources.

The REQ responder takes min(REQ.responderResources,local
intiatorDepth) and returns that in REP.initiatorDepth. It takes
min(REQ.initiatorDepth, HW respres capability) and plugs that into the
local QP and returns it in REp.responderResources

The REQ initiator takes that reply and does
min(REP.responderResources,HW initdepth capability,API depth) and
plugs that into the QP and does checks that REP.initDepth <
REQ.responderResources and errors if false, and plugs REP.initDepth
into the local QP's responder resources.

The swapping and general missing handling of RR negotiating in the
whole kernel CM API (not just RDMA CM, but IB CM too) is a
longstanding bug, and I have written user space code that fixes it up
in the past :(

It works OK if both sides hard code 2 or 4, or whatever is 99% of use
cases, it is broken if you are doing what Or is talking about, and
optimizing RR usage because on half of a connection doesn't use RRs at
all.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v1 for-next 00/16] Bug fixes for ocrdma driver

2014-06-23 Thread Devesh Sharma
Roland,

Please consider this series to be merged to the next pull request to Linus. 
This contains some critical bug fixes for ocrdma, and we don't want to miss
this pull cycle.

-Best Regards
 Devesh
> -Original Message-
> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
> ow...@vger.kernel.org] On Behalf Of Selvin Xavier
> Sent: Tuesday, June 10, 2014 7:32 PM
> To: linux-rdma@vger.kernel.org
> Cc: rol...@kernel.org; Selvin Xavier
> Subject: [PATCH v1 for-next 00/16] Bug fixes for ocrdma driver
> 
> This patch series contains few bug fixes for ocrdma driver
> 
> Changes in v1:
>  Proper change logs are added as per Or's suggestion  Updating the ocrdma
> driver version string
> 
> Please apply to for-next tree
> 
> Devesh Sharma (5):
>   RDMA/ocrdma: Avoid posting DPP requests for RDMA READ
>   be2net: issue shutdown event to ocrdma driver
>   RDMA/ocrdma: Handle shutdown event from be2net driver
>   RDMA/ocrdma: Remove hardcoding of the max DPP QPs supported
>   RDMA/ocrdma: Delete AH table if ocrdma_init_hw fails after AH table
> creation
> 
> Mitesh Ahuja (3):
>   RDMA/ocrdma: Allow only SEND opcode in case of UD QPs
>   RDMA/ocrdma: Do proper cleanup evenif FW is in error state
>   RDMA/ocrdma: Return proper value for max_mr_size
> 
> Selvin Xavier (8):
>   RDMA/ocrdma: Query and initalize the PFC SL
>   RDMA/ocrdma: Adding hca_type and fixing fw_version string in device
> atrributes
>   RDMA/ocrdma: Avoid reporting wrong completions in case of error CQEs
>   RDMA/ocrdma : Add missing adapter mailbox opcodes
>   RDMA/ocrdma: Increase the size of STAG array in dev structure to 16K
>   RDMA/ocrdma: Initialize the GID table while registering the device
>   RDMA/ocrdma: Fixing a sparse warning
>   RDMA/ocrdma: Update the ocrdma module version string
> 
>  drivers/infiniband/hw/ocrdma/ocrdma.h   |   26 -
>  drivers/infiniband/hw/ocrdma/ocrdma_ah.c|2 +
>  drivers/infiniband/hw/ocrdma/ocrdma_hw.c|  197
> ++-
>  drivers/infiniband/hw/ocrdma/ocrdma_hw.h|2 +
>  drivers/infiniband/hw/ocrdma/ocrdma_main.c  |   83 +++-
>  drivers/infiniband/hw/ocrdma/ocrdma_sli.h   |  148 
> -
>  drivers/infiniband/hw/ocrdma/ocrdma_verbs.c |   34 -
>  drivers/net/ethernet/emulex/benet/be.h  |1 +
>  drivers/net/ethernet/emulex/benet/be_main.c |1 +
>  drivers/net/ethernet/emulex/benet/be_roce.c |   18 ++-
>  drivers/net/ethernet/emulex/benet/be_roce.h |3 +-
>  11 files changed, 466 insertions(+), 49 deletions(-)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the
> body of a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: how to re-use a QP for a new connection

2014-06-23 Thread Hefty, Sean
> Steve Wise is helping me with a particular issue where QP re-use might
> be helpful.
> 
> When an RPC/RDMA transport connection is dropped (for example, the NFS
> server crashes), xprtrdma destroys the transport's QP and creates a
> new one for the next connection.

If the remote side crashes, the local QP can transition into the error state, 
which would flush all posted receives.  I believe that a WR that has completed 
in error only has the wr_id field valid.

Note that calling rdma_disconnect() will also transition the QP into the error 
state.
 
> We're not quite sure what IB_WC_WR_FLUSH_ERR means in that instance. Our
> theory is there is a gap when the old QP is destroyed:
> 
> 1. If the HW reports a successful WR completion but the QP no longer
>exists, the provider substitutes an IB_WC_WR_FLUSH_ERR completion
> 
> 2. If the WR is dropped before the HW even saw it, the provider inserts
>an IB_WC_WR_FLUSH_ERR completion
> 
> So if xprtrdma is trying to submit a FAST_REG_MR WR and the completion
> gets flushed, xprtrdma has no way to know whether the rkey was bumped in
> the adapter. Thus it has no certainty which rkey to use to invalidate
> that FRMR.

I'm not familiar with the behavior of fast reg mr.
 
> I was idly wondering whether re-using the QP during connection loss
> would provide a guarantee that xprtrdma would never see case 1 above.
> Then IB_WC_WR_FLUSH_ERR on a FAST_REG_MR WR would be a more certain
> indication that the HW still has the old rkey.
> 
> I suppose that xprtrdma can "hang onto" the QP without re-using it by
> simply not destroying it until all WRs scheduled on the old QP are
> completed. Is reference counting the QP the usual design pattern to deal
> with this case?

I _thought_ that destroying the QP would cleanup any completion entries in the 
CQ, but I'm not sure of this.  Referencing counting should work though. 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH for-next 0/2] Bug fixes for nfs-rdma support

2014-06-23 Thread Devesh Sharma
Roland,

Please pull this series to 3.16 to make sure ocrdma works fine with upstreame 
NFS-RDMA.
-Regards
 Devesh

> -Original Message-
> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
> ow...@vger.kernel.org] On Behalf Of dev...@vger.kernel.org
> Sent: Monday, June 09, 2014 10:53 AM
> To: linux-rdma@vger.kernel.org
> Cc: Devesh Sharma
> Subject: [PATCH for-next 0/2] Bug fixes for nfs-rdma support
> 
> From: Devesh Sharma 
> 
> This series fixes the improper frmr-page-list-len size reported by
> query_device and a bug in CQ arming logic.
> 
> Devesh Sharma (2):
>   RDMA/ocrdma: Report actual value of max_fast_reg_page_list_len
>   RDMA/ocrdma: do not skip setting deffered_arm
> 
>  drivers/infiniband/hw/ocrdma/ocrdma_verbs.c |6 ++
>  1 files changed, 2 insertions(+), 4 deletions(-)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the
> body of a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: how to re-use a QP for a new connection

2014-06-23 Thread Devesh Sharma
Hi Chuck,


> -Original Message-
> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
> ow...@vger.kernel.org] On Behalf Of Chuck Lever
> Sent: Monday, June 23, 2014 8:51 PM
> To: Hefty, Sean
> Cc: linux-rdma
> Subject: Re: how to re-use a QP for a new connection
> 
> Hi Sean-
> 
> On Jun 20, 2014, at 5:17 PM, Hefty, Sean  wrote:
> 
> >> During a remote transport disconnect, the QP leaves RTS.
> >>
> >> xprtrdma deals with this in a separate transport connect worker
> >> process, where it creates a new id and qp, and replaces the existing id and
> qp.
> >>
> >> Unfortunately there are parts of xprtrdma (namely FRMR
> >> deregistration) that are not easy to serialize with this reconnect logic.
> >>
> >> Re-using the QP would mean no serialization would be needed between
> >> transport reconnect and FRMR deregistration.
> >>
> >> If QP re-use is not supported, though, it's not worth considering any
> >> further.
> >
> > It may be possible to reuse the QP, just not the rdma_cm_id without
> additional code changes.  Reuse of the rdma_cm_id may also require
> changes in the underlying IB/iWarp CMs.
> 
> Steve Wise is helping me with a particular issue where QP re-use might be
> helpful.
> 
> When an RPC/RDMA transport connection is dropped (for example, the NFS
> server crashes), xprtrdma destroys the transport's QP and creates a new one
> for the next connection.
> 
> We're not quite sure what IB_WC_WR_FLUSH_ERR means in that instance.
> Our theory is there is a gap when the old QP is destroyed:
> 
> 1. If the HW reports a successful WR completion but the QP no longer
>exists, the provider substitutes an IB_WC_WR_FLUSH_ERR completion

QP still exists but its state is ERROR. This state change could be due to 
multiple reasons. The WQE/RQE which
Caused this state transition is reported by h/w in the corresponding CQE. Rest 
of the CQEs after that are completed
With FLUSH-ERROR. This means Data Flow cannot happen anymore and QP needs a 
reconnection OR a fresh QP needs to be
Created and reconnected.

> 
> 2. If the WR is dropped before the HW even saw it, the provider inserts
>an IB_WC_WR_FLUSH_ERR completion
> 
> So if xprtrdma is trying to submit a FAST_REG_MR WR and the completion
> gets flushed, xprtrdma has no way to know whether the rkey was bumped in
> the adapter. Thus it has no certainty which rkey to use to invalidate that
> FRMR.

If FRMR WQE is completed in flush, It must be assumed that the request is 
_incomplete_
> 
> I was idly wondering whether re-using the QP during connection loss would
> provide a guarantee that xprtrdma would never see case 1 above.
> Then IB_WC_WR_FLUSH_ERR on a FAST_REG_MR WR would be a more
> certain indication that the HW still has the old rkey.
> 
> I suppose that xprtrdma can "hang onto" the QP without re-using it by simply
> not destroying it until all WRs scheduled on the old QP are completed. Is
> reference counting the QP the usual design pattern to deal with this case?

Why can we assume that all those FRMRs for which flush completion is reported 
are invalid while rest are still valid
Even if a new connection is in place after some time?

> 
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the
> body of a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to re-use a QP for a new connection

2014-06-23 Thread Chuck Lever
Hi Sean-

On Jun 20, 2014, at 5:17 PM, Hefty, Sean  wrote:

>> During a remote transport disconnect, the QP leaves RTS.
>> 
>> xprtrdma deals with this in a separate transport connect worker process,
>> where it creates a new id and qp, and replaces the existing id and qp.
>> 
>> Unfortunately there are parts of xprtrdma (namely FRMR deregistration)
>> that are not easy to serialize with this reconnect logic.
>> 
>> Re-using the QP would mean no serialization would be needed between
>> transport reconnect and FRMR deregistration.
>> 
>> If QP re-use is not supported, though, it's not worth considering any
>> further.
> 
> It may be possible to reuse the QP, just not the rdma_cm_id without 
> additional code changes.  Reuse of the rdma_cm_id may also require changes in 
> the underlying IB/iWarp CMs.

Steve Wise is helping me with a particular issue where QP re-use might
be helpful.

When an RPC/RDMA transport connection is dropped (for example, the NFS
server crashes), xprtrdma destroys the transport's QP and creates a
new one for the next connection.

We’re not quite sure what IB_WC_WR_FLUSH_ERR means in that instance. Our
theory is there is a gap when the old QP is destroyed:

1. If the HW reports a successful WR completion but the QP no longer
   exists, the provider substitutes an IB_WC_WR_FLUSH_ERR completion

2. If the WR is dropped before the HW even saw it, the provider inserts
   an IB_WC_WR_FLUSH_ERR completion

So if xprtrdma is trying to submit a FAST_REG_MR WR and the completion
gets flushed, xprtrdma has no way to know whether the rkey was bumped in
the adapter. Thus it has no certainty which rkey to use to invalidate
that FRMR.

I was idly wondering whether re-using the QP during connection loss
would provide a guarantee that xprtrdma would never see case 1 above.
Then IB_WC_WR_FLUSH_ERR on a FAST_REG_MR WR would be a more certain
indication that the HW still has the old rkey.

I suppose that xprtrdma can “hang onto” the QP without re-using it by
simply not destroying it until all WRs scheduled on the old QP are
completed. Is reference counting the QP the usual design pattern to deal
with this case?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


PCI/AER: AER in SRIOV environment

2014-06-23 Thread Yishai Hadas

Hi Vijay,
Trying to add AER support for Mellanox NIC in SRIOV environment, while 
evaluating/testing encountered a problem which led me to your
patch accepted as part of kernel 3.8, commit ID 
"918b4053184c0ca22236e70e299c5343eea35304".


Have some concerns/questions on:
When working in SRIOV environment VFs may be un-attached, having no 
driver assigned to, or may be attached to Virtual machine to work in 
some pass-through mode.
Once working in KVM setup there is pci-stub driver which is loaded in 
the HYP/PF for a given attached VF.


I'm using the aer-inject kernel module and its corresponding aer-inject 
tool to simulate an error in the HYP.
In both cases your commit will cause the AER recovery to fail as there 
is no driver assigned to PF's VFs that supports AER, comparing the code 
before your change.


How such cases should work ?  my expectation was that the PF will get 
the error detected message then will recognize whether
issue is its own or one of its VFs and work accordingly, in current code 
looks like recovery failed as part of "voting" once there is no AER 
handler assigned to the VFs.


Yishai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/22] infiniband: Use pci_zalloc_consistent

2014-06-23 Thread Joe Perches
Remove the now unnecessary memset too.

Signed-off-by: Joe Perches 
---
 drivers/infiniband/hw/amso1100/c2.c   |  6 ++
 drivers/infiniband/hw/nes/nes_hw.c| 12 ++--
 drivers/infiniband/hw/nes/nes_verbs.c |  5 ++---
 3 files changed, 10 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2.c 
b/drivers/infiniband/hw/amso1100/c2.c
index 00400c3..766a71c 100644
--- a/drivers/infiniband/hw/amso1100/c2.c
+++ b/drivers/infiniband/hw/amso1100/c2.c
@@ -604,16 +604,14 @@ static int c2_up(struct net_device *netdev)
tx_size = c2_port->tx_ring.count * sizeof(struct c2_tx_desc);
 
c2_port->mem_size = tx_size + rx_size;
-   c2_port->mem = pci_alloc_consistent(c2dev->pcidev, c2_port->mem_size,
-   &c2_port->dma);
+   c2_port->mem = pci_zalloc_consistent(c2dev->pcidev, c2_port->mem_size,
+&c2_port->dma);
if (c2_port->mem == NULL) {
pr_debug("Unable to allocate memory for "
"host descriptor rings\n");
return -ENOMEM;
}
 
-   memset(c2_port->mem, 0, c2_port->mem_size);
-
/* Create the Rx host descriptor ring */
if ((ret =
 c2_rx_ring_alloc(&c2_port->rx_ring, c2_port->mem, c2_port->dma,
diff --git a/drivers/infiniband/hw/nes/nes_hw.c 
b/drivers/infiniband/hw/nes/nes_hw.c
index 9020024..02120d3 100644
--- a/drivers/infiniband/hw/nes/nes_hw.c
+++ b/drivers/infiniband/hw/nes/nes_hw.c
@@ -1003,13 +1003,13 @@ int nes_init_cqp(struct nes_device *nesdev)
(sizeof(struct nes_hw_aeqe) * nesadapter->max_qp) +
sizeof(struct nes_hw_cqp_qp_context);
 
-   nesdev->cqp_vbase = pci_alloc_consistent(nesdev->pcidev, 
nesdev->cqp_mem_size,
-   &nesdev->cqp_pbase);
+   nesdev->cqp_vbase = pci_zalloc_consistent(nesdev->pcidev,
+ nesdev->cqp_mem_size,
+ &nesdev->cqp_pbase);
if (!nesdev->cqp_vbase) {
nes_debug(NES_DBG_INIT, "Unable to allocate memory for host 
descriptor rings\n");
return -ENOMEM;
}
-   memset(nesdev->cqp_vbase, 0, nesdev->cqp_mem_size);
 
/* Allocate a twice the number of CQP requests as the SQ size */
nesdev->nes_cqp_requests = kzalloc(sizeof(struct nes_cqp_request) *
@@ -1691,13 +1691,13 @@ int nes_init_nic_qp(struct nes_device *nesdev, struct 
net_device *netdev)
(NES_NIC_WQ_SIZE * 2 * sizeof(struct nes_hw_nic_cqe)) +
sizeof(struct nes_hw_nic_qp_context);
 
-   nesvnic->nic_vbase = pci_alloc_consistent(nesdev->pcidev, 
nesvnic->nic_mem_size,
-   &nesvnic->nic_pbase);
+   nesvnic->nic_vbase = pci_zalloc_consistent(nesdev->pcidev,
+  nesvnic->nic_mem_size,
+  &nesvnic->nic_pbase);
if (!nesvnic->nic_vbase) {
nes_debug(NES_DBG_INIT, "Unable to allocate memory for NIC host 
descriptor rings\n");
return -ENOMEM;
}
-   memset(nesvnic->nic_vbase, 0, nesvnic->nic_mem_size);
nes_debug(NES_DBG_INIT, "Allocated NIC QP structures at %p (phys = 
%016lX), size = %u.\n",
nesvnic->nic_vbase, (unsigned long)nesvnic->nic_pbase, 
nesvnic->nic_mem_size);
 
diff --git a/drivers/infiniband/hw/nes/nes_verbs.c 
b/drivers/infiniband/hw/nes/nes_verbs.c
index 218dd35..fef067c 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -1616,8 +1616,8 @@ static struct ib_cq *nes_create_cq(struct ib_device 
*ibdev, int entries,
entries, nescq->cq_mem_size, 
nescq->hw_cq.cq_number);
 
/* allocate the physical buffer space */
-   mem = pci_alloc_consistent(nesdev->pcidev, nescq->cq_mem_size,
-   &nescq->hw_cq.cq_pbase);
+   mem = pci_zalloc_consistent(nesdev->pcidev, nescq->cq_mem_size,
+   &nescq->hw_cq.cq_pbase);
if (!mem) {
printk(KERN_ERR PFX "Unable to allocate pci memory for 
cq\n");
nes_free_resource(nesadapter, 
nesadapter->allocated_cqs, cq_num);
@@ -1625,7 +1625,6 @@ static struct ib_cq *nes_create_cq(struct ib_device 
*ibdev, int entries,
return ERR_PTR(-ENOMEM);
}
 
-   memset(mem, 0, nescq->cq_mem_size);
nescq->hw_cq.cq_vbase = mem;
nescq->hw_cq.cq_head = 0;
nes_debug(NES_DBG_CQ, "CQ%u virtual address @ %p, phys = 
0x%08X\n",
-- 
1.8.1.2.459.gbcd45b4.dirty

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel

Re: [PATCH 05/22] infiniband: Use pci_zalloc_consistent

2014-06-23 Thread Steve Wise

For the amso1100 change...

Acked-by: Steve Wise 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/22] Add and use pci_zalloc_consistent

2014-06-23 Thread Joe Perches
Adding the helper reduces object code size as well as overall
source size line count.

It's also consistent with all the various zalloc mechanisms
in the kernel.

Done with a simple cocci script and some typing.

Joe Perches (22):
  pci-dma-compat: Add pci_zalloc_consistent helper
  atm: Use pci_zalloc_consistent
  block: Use pci_zalloc_consistent
  crypto: Use pci_zalloc_consistent
  infiniband: Use pci_zalloc_consistent
  i810: Use pci_zalloc_consistent
  media: Use pci_zalloc_consistent
  amd: Use pci_zalloc_consistent
  atl1e: Use pci_zalloc_consistent
  enic: Use pci_zalloc_consistent
  sky2: Use pci_zalloc_consistent
  micrel: Use pci_zalloc_consistent
  qlogic: Use pci_zalloc_consistent
  irda: Use pci_zalloc_consistent
  ipw2100: Use pci_zalloc_consistent
  mwl8k: Use pci_zalloc_consistent
  rtl818x: Use pci_zalloc_consistent
  rtlwifi: Use pci_zalloc_consistent
  scsi: Use pci_zalloc_consistent
  staging: Use pci_zalloc_consistent
  synclink_gt: Use pci_zalloc_consistent
  vme: bridges: Use pci_zalloc_consistent

 drivers/atm/he.c   | 31 -
 drivers/atm/idt77252.c | 15 
 drivers/block/DAC960.c | 18 +-
 drivers/block/cciss.c  | 11 +++---
 drivers/block/skd_main.c   | 25 +-
 drivers/crypto/hifn_795x.c |  5 ++-
 drivers/gpu/drm/i810/i810_dma.c|  5 ++-
 drivers/infiniband/hw/amso1100/c2.c|  6 ++--
 drivers/infiniband/hw/nes/nes_hw.c | 12 +++
 drivers/infiniband/hw/nes/nes_verbs.c  |  5 ++-
 drivers/media/common/saa7146/saa7146_core.c| 15 
 drivers/media/common/saa7146/saa7146_fops.c|  5 +--
 drivers/media/pci/bt8xx/bt878.c| 16 +++--
 drivers/media/pci/ngene/ngene-core.c   |  7 ++--
 drivers/media/usb/ttusb-budget/dvb-ttusb-budget.c  | 11 ++
 drivers/media/usb/ttusb-dec/ttusb_dec.c| 11 ++
 drivers/net/ethernet/amd/pcnet32.c | 16 -
 drivers/net/ethernet/atheros/atl1e/atl1e_main.c|  7 ++--
 drivers/net/ethernet/cisco/enic/vnic_dev.c |  8 ++---
 drivers/net/ethernet/marvell/sky2.c|  5 ++-
 drivers/net/ethernet/micrel/ksz884x.c  |  7 ++--
 .../net/ethernet/qlogic/netxen/netxen_nic_ctx.c|  4 +--
 drivers/net/ethernet/qlogic/qlge/qlge_main.c   | 11 +++---
 drivers/net/irda/vlsi_ir.c |  4 +--
 drivers/net/wireless/ipw2x00/ipw2100.c | 16 +++--
 drivers/net/wireless/mwl8k.c   |  6 ++--
 drivers/net/wireless/rtl818x/rtl8180/dev.c | 11 +++---
 drivers/net/wireless/rtlwifi/pci.c | 17 +++--
 drivers/scsi/3w-sas.c  |  5 ++-
 drivers/scsi/a100u2w.c |  8 ++---
 drivers/scsi/be2iscsi/be_main.c| 10 +++---
 drivers/scsi/be2iscsi/be_mgmt.c|  3 +-
 drivers/scsi/csiostor/csio_wr.c|  8 +
 drivers/scsi/eata.c|  5 ++-
 drivers/scsi/hpsa.c|  8 ++---
 drivers/scsi/megaraid/megaraid_mbox.c  | 16 -
 drivers/scsi/megaraid/megaraid_sas_base.c  |  8 ++---
 drivers/scsi/mesh.c|  6 ++--
 drivers/scsi/mvumi.c   |  9 ++---
 drivers/scsi/pm8001/pm8001_sas.c   |  5 ++-
 drivers/staging/rtl8192e/rtl8192e/rtl_core.c   | 15 +++-
 drivers/staging/rtl8192ee/pci.c| 37 +++-
 drivers/staging/rtl8821ae/pci.c| 36 +++
 drivers/staging/slicoss/slicoss.c  |  9 ++---
 drivers/staging/vt6655/device_main.c   | 40 +++---
 drivers/tty/synclink_gt.c  |  5 ++-
 drivers/vme/bridges/vme_ca91cx42.c |  6 ++--
 drivers/vme/bridges/vme_tsi148.c   |  6 ++--
 include/asm-generic/pci-dma-compat.h   |  8 +
 49 files changed, 209 insertions(+), 354 deletions(-)

-- 
1.8.1.2.459.gbcd45b4.dirty

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] iw_cxgb4: Fix skb_leak in reject_cr

2014-06-23 Thread Hariprasad Shenai
Based on origninal work by Steve Wise 

Signed-off-by: Steve Wise 
Signed-off-by: Hariprasad Shenai 
---
 drivers/infiniband/hw/cxgb4/cm.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index 5e153f6..cc36e9b 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -2180,7 +2180,6 @@ static void reject_cr(struct c4iw_dev *dev, u32 hwtid, 
struct sk_buff *skb)
PDBG("%s c4iw_dev %p tid %u\n", __func__, dev, hwtid);
BUG_ON(skb_cloned(skb));
skb_trim(skb, sizeof(struct cpl_tid_release));
-   skb_get(skb);
release_tid(&dev->rdev, hwtid, skb);
return;
 }
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] iw_cxgb4: clean up connection on arp error

2014-06-23 Thread Hariprasad Shenai
Based on origninal work by Steve Wise 

Signed-off-by: Steve Wise 
Signed-off-by: Hariprasad Shenai 
---
 drivers/infiniband/hw/cxgb4/cm.c |   11 ++-
 1 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index cc36e9b..6a93280 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -432,8 +432,17 @@ static void arp_failure_discard(void *handle, struct 
sk_buff *skb)
  */
 static void act_open_req_arp_failure(void *handle, struct sk_buff *skb)
 {
+   struct c4iw_ep *ep = handle;
+
printk(KERN_ERR MOD "ARP failure duing connect\n");
kfree_skb(skb);
+   connect_reply_upcall(ep, -EHOSTUNREACH);
+   state_set(&ep->com, DEAD);
+   remove_handle(ep->com.dev, &ep->com.dev->atid_idr, ep->atid);
+   cxgb4_free_atid(ep->com.dev->rdev.lldi.tids, ep->atid);
+   dst_release(ep->dst);
+   cxgb4_l2t_release(ep->l2t);
+   c4iw_put_ep(&ep->com);
 }
 
 /*
@@ -658,7 +667,7 @@ static int send_connect(struct c4iw_ep *ep)
opt2 |= T5_OPT_2_VALID;
opt2 |= V_CONG_CNTRL(CONG_ALG_TAHOE);
}
-   t4_set_arp_err_handler(skb, NULL, act_open_req_arp_failure);
+   t4_set_arp_err_handler(skb, ep, act_open_req_arp_failure);
 
if (is_t4(ep->com.dev->rdev.lldi.adapter_type)) {
if (ep->com.remote_addr.ss_family == AF_INET) {
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] Fixes skb leak and connection clean up on ARP error

2014-06-23 Thread Hariprasad Shenai
Hi,

This patch series fixes skb leak and connection clean up on ARP error for
iw_cxgb4 driver. 

This patch series is created on top of linux-next tree. We would like to
request this patch series to get merged via Roland's infiniband tree master
branch.

We have included all the maintainers of respective drivers. Kindly review the
change and let us know in case of any review comments.

Thanks

Hariprasad Shenai (2):
  iw_cxgb4: Fix skb_leak in reject_cr
  iw_cxgb4: clean up connection on arp error

 drivers/infiniband/hw/cxgb4/cm.c |   12 ++--
 1 files changed, 10 insertions(+), 2 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Sparkasse Online-Banking ©

2014-06-23 Thread Rodrigo
Sehr geehrter Kunde,
 
Bitte beachten Sie, dass Ihre E-Banking-Zugang ablдuft bald. dies
Weiter zum Service kцnnen zu verwenden, auf den Button unten klicken Sie
bitte
Ihr Zugang manuell mit unseren Sicherheitsupdate auf Link-
Update: Sparkasse Online-Banking Update: folgen Sie dem Link unten
 
klicken Sie auf 
http://sk-p-146b3032f607d93.yolasite.com/
 
Nach Vervollstдndigung dieses Schrittes werden Sie von einem Mitarbeiter
unseres Kundendienstes zum Status Ihres Kontos kontaktiert.
Beim e-Banking haben Sie per Mausklick alles im Griff!
Mit dem komfortablen Online-Banking Ihrer Sparkasse haben Sie schnellen und
problemlosen Zugang zu Ihrem Girokonto und erledigen Ьberweisungen und
Dauerauftrдge bequem per Mausklick. Das e-Banking bietet aber noch
viele weitere Vorteile.
 
DIE VORTEILE DES ONLINE-BANKINGS AUF EINEN BLICK:
 
- Kontozugang rund um die Uhr
 
- Schneller Zugriff aufs Girokonto
- E-Banking bequem vom PC aus
- Flexibel in jedem Winkel der Welt
 
- Ьbersichtliche Kontofьhrung
- Hohe Sicherheitsstandards
- E-Banking ist kombinierbar mit Telefon-Banking
 
Um diese Dienste weiterhin problemlos nutzen zu kцnnen, fьhren Sie bitte
das Update so schnell wie mцglich durch.
 
Respektvoll,
 
Sparkasse Online-Banking.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mlx4 - query regarding PF VF functionality division

2014-06-23 Thread Or Gerlitz
On Mon, Jun 23, 2014 at 12:33 PM, Bob Biloxi  wrote:
[...]
> Is there any way we can clearly separate the files that are used by PF
> vs the files that are used by VF in the (drivers/net/ethernet/mlx4
> sub-directory)?
[...]

Not really, but let's take EIM approach, what's your goal/mission?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mlx4 - query regarding PF VF functionality division

2014-06-23 Thread Bob Biloxi
Hi All,

I was going through the Mellanox driver (mlx4) and then I had
difficulty understanding which portion of code corresponds to the one
executed by the PF(Physical Function Driver) and which portion of code
by (Virtual Function Driver) in the SRIOV mode.

My confusion is because, I was of the understanding that the QPs, CQs
(and their creation, state mgmt commands) etc are to be performed by
the virtual function driver(VF driver).


And the role of the physical function driver(PF driver) is to just
take care of the resource_tracker.c and ICM allocation.


But of late, I think I may have understood wrong. This is because
there is code that is specifically executed when mlx4_is_master is
true/false( indicating PF or VF).

And then, there is code which is not surrounded by this test, which
indicates it is executed in both cases(PF driver as well as VF
driver).

Is my understanding correct? If yes, then are the QPs, CQs and
ethernet tx, rx related functionality is executed both by master and
slave?

Is there any way we can clearly separate the files that are used by PF
vs the files that are used by VF in the (drivers/net/ethernet/mlx4
sub-directory)?

I would be really thankful and really appreciate all the
help/clarification I can get in understanding this.


Thank you so much.


Best Regards,
Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html