NFSoRDMA bi-weekly developers meeting minutes (5/28/15)

2015-05-28 Thread Shirley Ma
Attendees:

Jeff Becker (NASA)
Yan Burman (Mellanox)
Chuck Lever (Oracle)
Rupert Dance (Soft Forge)
Steve Dickson (Red Hat)
Shirley Ma (Oracle)
Anna Schumaker (Net App)
Devesh Sharma (Avago Tech)

Today's meeting notes:

NFSoRDMA deployment (Jeff)
---
NASA is interested in deployment NFS/RDMA, it's on the table.

NFSoRDMA RFC update (Chuck)

Updating IETF specifications for RPC/RDMA (RGFC5666) and NFS/RDMA (RFC5667) to 
close the holes that might cause confusion. We’d like to replace RFC 5666 with 
a new document that fixes problems with the RPC/RDMA version 1 protocol; and 
replace RFC 5667 with a new document that fills in missing or
incorrect items with regard to NFSv4.0 and NFSv4.1 (and pNFS). There will be no 
protocol changes that would break backwards compatibility.

Chuck has published an experimental draft describing the conventions needed for 
bi-directional RPC/RDMA:
https://datatracker.ietf.org/doc/draft-cel-nfsv4-rpcrdma-bidirection/

NFSoRDMA bi-directional implementation (Chuck)
---
Chuck has done an experimental implementation, which has been tested among 
Linux client, Linux server, Solaris client and Solaris server. NFS/RDMA 
bi-directional support is targeted in Linux 4.3 or 4.4, which might be 
available 6 months later.

Wireshark dissectors support RPC/RDMA, NFS/RDMA (Yan)
---
Yan and his team have been working on wireshark dissectors to support RPC/RDMA, 
NFS/RDMA. The patches will be submitted next week for review. This is a good 
news for debugging NFS/RDMA problems.

NFS/RDMA client and server shutdown issue (Nevesh, Shirley)
-
Any outstanding mounts will prevent both client and server from shut down 
properly. The right fix is to handle CM_DEVICE_REMOVAL properly so the 
resources being allocated can be gracefully clean up. We can suggest umount 
before shutdown the client to avoid this issue, but on server side, we need to 
add a handler to handle outstanding cm_ids, which is not presented in current 
implementation.

NFSD workqueue mode vs. threading mode performance impacts: (Shirley)
-
Shirley has spent sometime test/fix/evaluating NFSD workqueue mode originally 
implemented by Jeff Layton. NFSD workqueue patchset NFSD workqueue modes 
benefit some workloads while global threading mode benefits some other 
workloads. workqueue mode is better than per-cpu threading mode in general. 
workqueue is per-cpu workqueue while global threading mode is not per-cpu 
based. There might be some room to make NFSD workqueue not worse than global 
threading mode, but this requires to do more in depth performance investigation 
work.

Feel free to reply here for anything missing or incorrect. Thanks everyone for 
joining the call and providing valuable inputs/work to the community to make 
NFSoRDMA better.

Cheers,
Shirley
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


NFSoRDMA bi-weekly developers meeting minutes (4/16/15)

2015-04-16 Thread Shirley Ma
Attendees:


Jeff Becker (NASA)
Wendy Cheng (Intel)
Chuck Lever (Oracle)
Doug Ledford (RedHat)
Shirley Ma (Oracle)
Anna Schumaker (Net App)
Steve Wise (OpenGridComputing, Chelsio)

Today's meeting notes:

NFSoRDMA server support:
---
NFSoRDMA Server is still technology preview in RHEL 7.2, since customers are 
looking for Linux NFSoRDMA server support, there is possible to open NFSoRDMA 
server support in the future. Jeff from NASA has some plan to deploy Linux 
NFSoRDMA, Mellanox OFED stack release might not support it, move to upstream 
OFED 3.18 is an option. Their fabrics is InfiniBand.

NFSoRDMA test plan (Doug, Wendy, Chuck)
--
Doug talked about future NFSoRDMA test infrastructure in RedHat. He has 
completed SCSI disk drivers characterization, the BW is around 3.9GB/s, next is 
file system, then NFSoRDMA and storage RDMA. Performance work will be 
investigated in both HW and software stack to see where to optimize it. He is 
open to take inputs on which particular file system to test, people can give 
suggests on which file systems they are interested. Chuck suggested NFSoRDMA 
performance work should include tempfs test since persistent disk is too slow 
for NFSoRDMA transport layer performance evaluation. The fabrics is mixed with 
DDR, QDR and EDR.
  
Wendy also talked about the possibility NFSoRDMA test in Intel. Currently they 
don't include NFSoRDMA in SPECNFS test, she asked whether it's possible to 
include it. Chuck replied the test client doesn't support NFSoRDMA client. He 
will talk to the maintainer to see whether NFSoRDMA can be supported, but needs 
to good reason to add it. Intel's test set up is RHEL 7.1, if enables NFSoRDMA 
test, the kernel should move to 3.18.

NFS bake-a-thon planning: (Doug, Chuck)
-
Chuck and Doug talked about incoming bake-a-thon plan, Doug has a small IB test 
et up, he can bring it to test. Solaris requires particular HCAs to work. They 
will talk to SteveD about several different possibilities to make it work.

NFSoRDMA performance: (Shirley, Chuck, SteveW)
-
From ftrace function-graph report, both large I/O workload and intensive CPU 
workload showed that the interrupts and scheduling are two key contributes to 
NFS operations (READ, WRITE, GETATTR...). In CPU intensive workload like make 
kernel in the background, wake_up_bit is too costly. Is that possible to 
replace wake_up_bit? Chuck mentioned that it used to be wait queue, wake up 
queue. We could try it to see any performance difference. To reduce the 
interrupts, Shirley is trying to see the performance gain from disabling 
interrupts and wait a bit, there are some throughput gain with heavy cpu 
utilization. She will continue to investigate it. Chuck has tried to move 
completion up call from CQ to WQ context. There is no more items queuing in 
receiving queue, only saw one or two. SteveW replied that the queue should be 
built in this scenarios. Shirley thinks without addressing the scheduling 
issue in this workload, it might not be possible to see any other improvement.

Doug works on performance evaluation work in his large system test bed. He will 
set up more accurate baseline performance, then investigate where to optimized 
in the stack as well as the HW.

We are looking for good tools to simulate real workloads, like database 
workload. Right now we are using fio to simulate. We need to define real 
workloads. If anybody has any suggestion, that would be great. Shirley will 
send a separated email thread to discuss it.

Storage RDMA technology (Shirley, Chuck, Doug)
---
Whether can we use this call to cover iSCSI/iSER/SRP work? The performance 
issues will be different, however both of they use RDMA as transport, so it 
might be helpful to discuss storage RDMA work here as long as NFSoRDMA is still 
the focus and we have time to discuss about the work.


Feel free to reply here for anything missing or incorrect. Thanks everyone for 
joining the call and providing valuable inputs/work to the community to make 
NFSoRDMA better.

Cheers,
Shirley




--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


NFSoRDMA bi-weekly developers meeting minutes (4/2)

2015-04-02 Thread Shirley Ma
Attendees:

Yan Burman (Mellanox)
Steve Dickson (Red Hat)
Chuck Lever (Oracle)
Shirley Ma (Oracle)
Sachin Prabhu (RedHat)
Devesh Sharma (Emulex)
Anna Schumaker (Net App)
Steve Wise (OpenGridComputing, Chelsio)

Moderator:
Shirley Ma (Oracle)

Today's meeting notes:
Sorry for the late start, it has changed from 7:30am PST time to 8:30am PST 
time but I didn't notice the change wasn't sent out. :( Thanks for your patient 
to join the call one hour later.

1. NFSoRDMA deployment:
NFSoRDMA has much better performance than NFSoIPoIB-CM in general. People are 
looking for both Linux NFSoRDMA client and Server support for deployment, 
however distros only support NFSoRDMA client at this moment. Developers have 
been fixing bugs on server side, more dedicated resources on NFSoRDMA server 
development are needed for stability and performance work. We will continue to 
improve NFSoRDMA performance and try to find more resources on server side.

2. NFSoRDMA performance
After experimenting performance with different approaches (multiple QPs, 
different completion vector per QP), we think we should focus on single QP 
scalability first. Right now small I/O single QP IOPS is around 100K, large I/O 
single QP NFS READ can reach 3.6GB/s (which almost reaches link speed in the 
fabric). To identify single QP scalability, here are list of things we can try:
-- Generic RPC dispatching: identify serialized operations to make them 
paralleled
-- scheduling mechanism: wait_on_bit latency, queue work latency
-- RDMA transport layer: hack poll_cq on both client and server to make pulling 
longer, like wait for RPC RTT time, wait for more WCs to process more 
completions to reduce interrupts/wait overheads to see any better results

3. We will cover iSCSI/iSER/SRP in the future discussions.

10/23/2014
@8:30am PT DST
@9:30am MT DST
@10:30am CT DST
@11:30am ET DST
@Bangalore @9:00pm
@Israel @6:30pm

Duration: 1 hour

Call-in number:
Israel: +972 37219638
Bangalore: +91 8039890080 (180030109800)
France  Colombes +33 1 5760 +33 176728936
US: 8666824770,  408-7744073

Conference Code: 2308833
Passcode: 63767362 (it's NFSoRDMA, in case you couldn't remember)

Thanks everyone for joining the call and providing valuable inputs/work to the 
community to make NFSoRDMA better.

Shirley
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB_CQ_VECTOR_LEAST_ATTACHED

2014-12-07 Thread Shirley Ma


On 12/07/2014 12:08 PM, Chuck Lever wrote:
 
 On Dec 7, 2014, at 5:20 AM, Sagi Grimberg sa...@dev.mellanox.co.il wrote:
 
 On 12/4/2014 9:41 PM, Shirley Ma wrote:
 On 12/04/2014 10:43 AM, Bart Van Assche wrote:
 On 12/04/14 17:47, Shirley Ma wrote:
 What's the history of this patch?
 http://lists.openfabrics.org/pipermail/general/2008-May/050813.html

 I am working on multiple QPs workload. And I created a similar approach
 with IB_CQ_VECTOR_LEAST_ATTACHED, which can bring up about 17% small I/O
 performance. I think this CQ_VECTOR loading balance should be maintained
 in provider not the caller. I didn't see this patch was submitted to
 mainline kernel, wonder any reason behind?

 My interpretation is that an approach similar to 
 IB_CQ_VECTOR_LEAST_ATTACHED is useful on single-socket systems but 
 suboptimal on multi-socket systems. Hence the code for associating CQ sets 
 with CPU sockets in the SRP initiator. These changes have been queued for 
 kernel 3.19. See also branch drivers-for-3.19 in git repo 
 git://git.infradead.org/users/hch/scsi-queue.git.

 What I did is that I manually controlled IRQ and working thread on the same 
 socket. The CQ is created when mounting the file system in NFS/RDMA, but 
 the workload thread might start from different socket, so per-cpu based 
 implementation might not apply. I will look at SRP implementation.


 Hey Shirley,

 Bart is correct, in general the LEAST_ATTACHED approach might not be
 optimal in the NUMA case. The thread - QP/CQ/CPU assignment is
 addressed by the multi-channel approach which to my understanding won't
 be implemented in NFSoRDMA in the near future (right Chuck?)
 
 As I understand it, the preference of the Linux NFS community is that
 any multi-pathing solution should be transparent to the ULP (NFS and
 RPC, in this case). mp-tcp is ideal in that the ULP is presented with
 a single virtual transport instance, but under the covers, that instance
 can be backed by multiple active paths.
 
 Alternately, pNFS can be deployed. This allows a dataset to be striped
 across multiple servers (and networks). There is a rather high bar to
 entering this arena however.
 
 Speculating aloud, multiple QPs per transport instance may require
 implementation changes on the server as well as the client. Any
 interoperability dependencies should be documented via a standards
 process.
 
 And note that an RPC transport (at least in kernel) is shared across
 many user applications and mount points. I find it difficult to visualize
 an intuitive and comprehensive administrative interface where enough
 guidance is provided to place a set of NFS applications and an RPC
 transport in the same resource domain (maybe cgroups?).
 
 So for the time being I prefer staying with a single QP per client-
 server pair.
 
 A large NFS client can actively use many NFS servers, however. Each
 client-server pair would benefit from finding least-used resources
 when QP and CQs are created. That is something we can leverage today.
 

Yes, that's something I am evaluating now for one NFS client to different 
destination servers. I can see more than 15% BW performance increase when 
simulating multiple servers mount points through different IPoIB child 
interfaces and changing create_cq completion vector from 0 and least-used, so 
comp vectors are balanced among QPs.

 However, the LEAST_ATTACH vector hint will revive again in the future
 as there is a need to spread applications on different interrupt
 vectors (especially for user-space).

 CC'ing Matan who is working on this, perhaps he can comment on this as
 well.
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


IB_CQ_VECTOR_LEAST_ATTACHED

2014-12-04 Thread Shirley Ma
Hello Or, Eli,

What's the history of this patch?
http://lists.openfabrics.org/pipermail/general/2008-May/050813.html

I am working on multiple QPs workload. And I created a similar approach with 
IB_CQ_VECTOR_LEAST_ATTACHED, which can bring up about 17% small I/O 
performance. I think this CQ_VECTOR loading balance should be maintained in 
provider not the caller. I didn't see this patch was submitted to mainline 
kernel, wonder any reason behind?

Thanks
Shirley 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB_CQ_VECTOR_LEAST_ATTACHED

2014-12-04 Thread Shirley Ma
On 12/04/2014 10:43 AM, Bart Van Assche wrote:
 On 12/04/14 17:47, Shirley Ma wrote:
 What's the history of this patch?
 http://lists.openfabrics.org/pipermail/general/2008-May/050813.html

 I am working on multiple QPs workload. And I created a similar approach
 with IB_CQ_VECTOR_LEAST_ATTACHED, which can bring up about 17% small I/O
 performance. I think this CQ_VECTOR loading balance should be maintained
 in provider not the caller. I didn't see this patch was submitted to
 mainline kernel, wonder any reason behind?
 
 My interpretation is that an approach similar to IB_CQ_VECTOR_LEAST_ATTACHED 
 is useful on single-socket systems but suboptimal on multi-socket systems. 
 Hence the code for associating CQ sets with CPU sockets in the SRP initiator. 
 These changes have been queued for kernel 3.19. See also branch 
 drivers-for-3.19 in git repo git://git.infradead.org/users/hch/scsi-queue.git.

What I did is that I manually controlled IRQ and working thread on the same 
socket. The CQ is created when mounting the file system in NFS/RDMA, but the 
workload thread might start from different socket, so per-cpu based 
implementation might not apply. I will look at SRP implementation.

Thanks,
Shirley
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


NFSoRDMA developers bi-weekly meeting minutes (11/20)

2014-11-20 Thread Shirley Ma
Attendees:

Jeff Becker (NASA)
Yan Burman (Mellanox)
Wendy Cheng (Intel)
Rupert Dance (Soft Forge)
Steve Dickson (Red Hat)
Chuck Lever (Oracle)
Doug Ledford (RedHat)
Shirley Ma (Oracle)
Sachin Prabhu (RedHat)
Devesh Sharma (Emulex)
Anna Schumaker (Net App)
Steve Wise (OpenGridComputing, Chelsio)

Moderator:
Shirley Ma (Oracle)

NFSoRDMA developers bi-weekly meeting is to help organizing NFSoRDMA 
development and test effort from different resources to speed up NFSoRDMA 
upstream kernel work and NFSoRDMA diagnosing/debugging tools development. 
Hopefully the quality of NFSoRDMA upstream patches can be improved by being 
tested with a quorum of HW vendors.

Today's meeting notes:

NFSoRDMA performance:
-
Even though NFSoRDMA performance seems better than IPoIB-cm, the gap between 
what the IB protocol can provide and what NFS(RDMA,IPoIB-cm) can achieve is 
still big on small I/O block size (focused on 8K IO size for database 
workload). Even large I/O block size(128K above), NFS performance is not 
comparable to RDMA microbenchmark. We are focusing the effort to figure out the 
root cause. Several experimental methods have been used on how to improve 
NFSoRDMA performance.

Yan saw NFS server does RDMA send for small packet size, less than 100bytes, 
which should have used post_send instead.

1. performance experimental investigation: (Shirley, Chuck, Yan)
-- multiple QPs support:
Created multiple subnets with different partition keys, different NFS client 
mount points to stretch single link performance, iozone multiple threading DIO 
8K showed around 17% improvement, still a big gap to link speed

-- completion vector loading balance
Split send queue and completion queue interrupts to different CPUs did not help 
on performance, then created a patch on distributing interrupts among available 
CPUs for different QPs, send and recv completion share the same completion 
vector, iozone multiple threading 8K DIO showed that 10% performance improvement

Yan shared iser performance enhancement ideas:
-- batch recv packet processing
-- batch completion processing, not signaling every completion
-- per CPU connection, cq
iser 8K could reach 4.5GB/s in 56Gb/s link speed, 1.5 million IOPS. 32K could 
reach 1.8 million IOPS
 
-- increasing RPC credit limit from 32 to 64
iozone 8K DIO results doesn't show any gain, which might indicate that we need 
to look at general NFS IO stack.

-- increasing work queue priority to reduce latency
NFS used work_queue not tasklet since it's in cansleep context, changed the 
flags to WQ_HIGHPRI | WQ_CPU_INTENSIVE did help reducing the latency when 
system under heavy workloads. 

-- lock contention
perf top does show lock contention on top five list for both NFS client and NFS 
server. More granularity lock contention investigation is needed.

-- scheduling latency
IO scheduling was developed for high latency devices, there might be some room 
in IO scheduling improvement.

-- wsize, rsize
Chuck is looking at wsize, rsize to 1MB

2. performance analysis tools to use: 
-- perf, lockstat, ftrace, mountstats, nfsiostats... 

3. performance test tools:
-- iozone, fio
-- direct IO, cached IO

Next step for performance analysis:
1. Shirley will collect performance data on NFS IO layer to see any bottlenecks 
there.
2. Someone needs to look at NFS server for RDMA small message size Yan has seen

Feel free to reply here for anything missing. See you 12/4.

12/04/2014
@7:30am PDT
@8:30am MDT
@9:30am CDT
@10:30am EDT
@Bangalore @8:00pm
@Israel @5:30pm

Duration: 1 hour

Call-in number:
Israel: +972 37219638
Bangalore: +91 8039890080 (180030109800)
France  Colombes +33 1 5760 +33 176728936
US: 8666824770,  408-7744073

Conference Code: 2308833
Passcode: 63767362 (it's NFSoRDMA, in case you couldn't remember)

Thanks everyone for joining the call and providing valuable inputs/work to the 
community to make NFSoRDMA better.

Cheers,
Shirley
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSoRDMA developers bi-weekly meeting minutes (11/20)

2014-11-20 Thread Shirley Ma

On 11/20/2014 12:15 PM, Cheng, Wendy wrote:
 -Original Message-
 From: Shirley Ma [mailto:shirley...@oracle.com]
 Sent: Thursday, November 20, 2014 10:24 AM

 
 iser 8K could reach 4.5GB/s in 56Gb/s link speed, 1.5 million IOPS. 32K could
 reach 1.8 million IOPS

 
 How did the ISER data get measured ? Was the measure done on ISER layer, 
 block layer, or filesystem layer ?

Here is the link on iser how to set up and measure performance:
http://community.mellanox.com/docs/DOC-1483
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


NFSoRDMA bi-weekly meeting minutes (11/6)

2014-11-06 Thread Shirley Ma
Attendees:

Jeff Becker (NASA)
Wendy Cheng (Intel)
Rupert Dance (Soft Forge)
Steve Dickson (Red Hat)
Chuck Lever (Oracle)
Doug Ledford (RedHat)
Shirley Ma (Oracle)
Sachin Prabhu (RedHat)
Devesh Sharma (Emulex)
Anna Schumaker (Net App)
Steve Wise (OpenGridComputing, Chelsio)

Yan Burman(Mellanox) missed the call because of the daylight time change. :(

Moderator:
Shirley Ma (Oracle)

NFSoRDMA developers bi-weekly meeting is to help organizing NFSoRDMA 
development and test effort from different resources to speed up NFSoRDMA 
upstream kernel work and NFSoRDMA diagnosing/debugging tools development. 
Hopefully the quality of NFSoRDMA upstream patches can be improved by being 
tested with a quorum of HW vendors.

Today's meeting notes:
1. OFA OFA Interop event (Rupert)
The Interop event went pretty well. The test covered IB, RoCE and iWARP with 
different vendors HW and upsteam/OFED stack. NFSoRDMA IB was included in this 
test event, however NFSoRDMA RoCE wasn't able to test since the modules were 
not in the stack yet. The detail report will come in a few weeks.

2. Upstream bugs: (Chuck, Anna, Shirley)
3.17 kernel has a bug in tearing down connection, this bug was hit consistently 
when enabling multiple EQs in xprtrdma when Shirley run fio multithread random 
read/write workload. Chuck has a nice patch to this bug, Shirley has validated 
this fix by stressing the fio overnight. Anna will check to see the possibility 
to push to the stable tree since it blocks multi-threads NFSoRDMA workload. 
Here is the link to the bug report:
https://bugzilla.linux-nfs.org/show_bug.cgi?id=276

3. Performance test and analyze tools: (Sachin, Chuck, Wendy, Shirley, SteveW)
Discussed about several tools on analyzing NFSoRDMA performance for both 
latency and bandwidth:
-- systemTap: Sachine starts to look at how to use systemTap, it requires 
sometimes to study the tool and create the probe scripts to NFS, RPC, xprtrdma 
layer.
-- Ftrace: enabling trace modules and functions to report the execution flow 
latency.
-- perf: report execution flows APIs latency and cpu usage
-- /proc/self/mountstats: report total execution time, RTT and wait time for 
each RPC. The execution time latency contributes from wake up and wait, which 
depends on how busy the system is. RPC RTT itself latency is reasonable.

The NFSoRDMA performance relies on both implementation and protocol. We don't 
know the weight of performance gap from either implementation or protocol yet. 
RPC seems slow, pNFS might have better performance for supporting multiple 
queue pairs. Chuck will increase RPC credit limit to see how much performance 
gain from there. Our performance goal is to look at the implementation issues, 
then protocols.

Feel free to reply here for anything missing or incorrect. See you on Nov.20th.

10/23/2014
@7:30am PT
@8:30am MT
@9:30am CT
@10:30am ET
@Bangalore @9:00pm
@Israel @6:30pm

Duration: 1 hour

Call-in number:
Israel: +972 37219638
Bangalore: +91 8039890080 (180030109800)
France  Colombes +33 1 5760 +33 176728936
US: 8666824770,  408-7744073

Conference Code: 2308833
Passcode: 63767362 (it's NFSoRDMA, in case you couldn't remember)

Thanks everyone for joining the call and providing valuable inputs/work to the 
community to make NFSoRDMA better.

Cheers,
Shirley
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


NFSoRDMA developers bi-weekly meeting minutes (10/23)

2014-10-23 Thread Shirley Ma
Attendees:

Steve Dickson (Red Hat)
Chuck Lever (Oracle)
Doug Ledford (RedHat)
Shirley Ma (Oracle)
Sachin Prabhu (RedHat)
Anna Schumaker (Net App)
Steve Wise (OpenGridComputing, Chelsio)

Moderator:
Shirley Ma (Oracle)

NFSoRDMA developers bi-weekly meeting is to help organizing NFSoRDMA 
development and test effort from different resources to speed up NFSoRDMA 
upstream kernel work and NFSoRDMA diagnosing/debugging tools development. 
Hopefully the quality of NFSoRDMA upstream patches can be improved by being 
tested with a quorum of HW vendors.

Today's meeting notes:
1. OFED update and bug status (Rupert):
-- Intel has discovered some issues with infinipath-psm and is working on an 
update so there will have to be an OFED 3.12-1 rc4
-- Vlad (Mellanox) has agreed to put together the next version of OFED which 
will be based on kernel 3.18 rc1. This will be ready will be ready to use in 
the OFA Interop Debug event next week. This will allow us to test some of these 
NFSoRDMA bugs that are outstanding.

2. NFS 4.1 RDMA client support (Chuck)
Chuck has submitted the patchset to upstream review. The patch includes 
bi-directional RPC xprt support and sidecar client support, default is TCP to 
handle backchannel if not specified in mounting option. Patchset is under 
review, here is the link:
http://www.spinics.net/lists/linux-nfs/msg47278.html
-- NFS 4.1 enables pNFS
-- bi-redictional RDMA

3. Performance tools and test I/O block size:
-- RPC GETATTR, LOOKUP, ACCESS, READ, WRITE latency are more important than 
other RPCs: mountstats ouput
-- I/O latency changes with heavy CPU workload (like kernel build)
-- direction I/O performance, tmpfs, ramdisk, big file size equal to 80% of 
physical memory
-- 8K block size performance in particular for database workload
-- mixed block size performance
-- benchmark tools: fio, iozone, dbench, connecthon, xfstest...
-- scalability: number of mountpoints, number of clients

4. RDMA emulate driver for testing if no HW is available: (Steve Wise)
-- soft iwarp: repo is 
https://www.gitorious.org/softiwarp/ (maintainer Bernard Metzler)

5. Whether we can use RDMA write for both NFS write/read to improve the 
performance?

Feel free to reply here for anything missing. See you on Nov.6th.

10/23/2014
@7:30am PDT
@8:30am MDT
@9:30am CDT
@10:30am EDT
@Bangalore @8:00pm
@Israel @5:30pm

Duration: 1 hour

Call-in number:
Israel: +972 37219638
Bangalore: +91 8039890080 (180030109800)
France  Colombes +33 1 5760 +33 176728936
US: 8666824770,  408-7744073

Conference Code: 2308833
Passcode: 63767362 (it's NFSoRDMA, in case you couldn't remember)

Thanks everyone for joining the call and providing valuable inputs/work to the 
community to make NFSoRDMA better.

Cheers,
Shirley
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


NFSoRDMA developers bi-weekly meeting minutes (10/9)

2014-10-09 Thread Shirley Ma
Attendees:

Yan Burman (Mellanox)
Rupert Dance (Soft Forge)
Steve Dickson (Red Hat)
Chuck Lever (Oracle)
Doug Ledford (RedHat)
Shirley Ma (Oracle)
Sachin Prabhu (RedHat)
Devesh Sharma (Emulex)
Anna Schumaker (Net App)
Steve Wise (OpenGridComputing, Chelsio)

Moderator:
Shirley Ma (Oracle)

NFSoRDMA developers bi-weekly meeting is to help organizing NFSoRDMA 
development and test effort from different resources to speed up NFSoRDMA 
upstream kernel work and NFSoRDMA diagnosing/debugging tools development. 
Hopefully the quality of NFSoRDMA upstream patches can be improved by being 
tested with a quorum of HW vendors.

Today's meeting notes: (Chuck, SteveD, Yan)
1. Follow-ups for Engineer resource allocation to speed up IB stack review 
process: in progress.

2. OFED update and bug status (Rupert):
OFED 3.12-1-RC3 is expected to be released next Monday after an update for 
infinipatch-psm which prevented RHEL 7.0 build.
Thanks for Steve Wise to resolve some NFSoRDMA issues. Two outstanding bugs:

http://bugs.openfabrics.org/bugzilla/show_bug.cgi?id=2489
Bug 2489 - System crash during cable pull test with Active NFS-RDMA share
Bug 2489 is outstanding but is resolved in 3.17 RC6, need to bisect the right 
patch for OFED 3.12-1.

http://bugs.openfabrics.org/bugzilla/show_bug.cgi?id=2507
The panic stack reported it's backport patch issue. To confirm that Steve Wise 
suggested to reproduce it with upstream 3.12 kernel. Devesh will build 3.12 
kernel and test it.
 
http://bugs.openfabrics.org/bugzilla/show_bug.cgi?id=2502
Bug 2502 is RDS bug, which will talk to Oracle directly.

3. mainline kernel update and bug status: (Chuck, SteveW, Devesh)

https://bugzilla.linux-nfs.org/show_bug.cgi?id=269
Bug 269 xfstests generic/113 on NFSv4.1 causes connection drops
It was found in ConnectX-2. The number of outstanding completions is limited to 
87. When exceeding this, post_send will fail. The SQ depth is 256. Whether this 
is a limitation on ConnectX-2? It's better to try on different HCAs to see the 
difference.

https://bugzilla.linux-nfs.org/show_bug.cgi?id=271
Steve Wise is making progress on this one.

Devesh suggested to use same approach on client side to reduce Server side 
signaling, he filed a bug to track this to see any performance difference.
https://bugzilla.linux-nfs.org/show_bug.cgi?id=272

4. Bake-a-thon NFSoRDMA conclave update: Chuck, SteveD gave update on 10/8 
Linux Enterprise NFSoRDMA
-- RHEL7.0 NFSoRDMA server is disabled, we still couldn't locate resource for 
NFSoRDMA server maintainer. NFSoRDMA client is supported.
-- NFSoRDMA test strategies and utilities: add NFSoRDMA test
-- NFSoRDMA future directions and features

5. 3.17-rc5 NFSoRDMA performance discussion: (Shirley)
Shirley has presented IOZone WRITE NFS performance numbers for NFSoIPoIB, 
NFSoRDMA FMR and FRWR mode on connectX-2. The discuss focus was on NFS WRITE. 
There are couple areas need to do further research:
-- NFS WRITE overhead: what limits NFS WRITE performance in NFS protocol?
-- Unexpected latency increase and BW drop in large I/O write
-- How much gain from IPoIB-CM SG, cheksum offloading patch
-- Yan suggested the test to move to ConnectX-3 since ConnectX-2 is out.

Feel free to reply here for anything missing. See you on Oct.23.

10/9/2014
@7:30am PDT
@8:30am MDT
@9:30am CDT
@10:30am EDT
@Bangalore @8:00pm
@Israel @5:30pm

Duration: 1 hour

Call-in number:
Israel: +972 37219638
Bangalore: +91 8039890080 (180030109800)
France  Colombes +33 1 5760 +33 176728936
US: 8666824770,  408-7744073

Conference Code: 2308833
Passcode: 63767362 (it's NFSoRDMA, in case you couldn't remember)

Thanks everyone for joining the call and providing valuable inputs/work to the 
community to make NFSoRDMA better.

Shirley
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


NFSoRDMA developers bi-weekly meeting (every other Thursday)

2014-09-11 Thread Shirley Ma
NFSoRDMA bi-weekly meetings have rescheduled from every other Wed to Thurs 
starting from today.

Agenda today:

- merge plans for 3.18 and 3.19
- linux-stable tree is not stable: :) what can we do better? (3.17-rc was 
unbootable with IPoIB for over a month)
- performance issues arising
- update from everyone

Thanks
Shirley

Meeting Reminder NFSoRDMA developers bi-weekly meeting 
Date: Thu, Sep 11, 2014
Time: 6:30 AM PST - 7:30 AM PST
Location: Audio Conference
Organizer: Shirley Ma
Attendees: devesh.sha...@emulex.com; anna.schuma...@netapp.com; 
dledf...@redhat.com; Chuck Lever; Shirley Ma; ... 
Description: @7:30am PST @8:30am MST @9:30am CST @10:30am EST @Bangalore 
@8:00pm @Israel @5:30pm Duration: 1 hour Call-in number: Israel: +972 37219638 
Bangalore: +91 8039890080 (180030109800)
France Colombes +33 1 5760  +33 176728936
US: 8666824770, 408-7744073
Conference Code: 2308833 Passcode: 63767362 (it's NFSoRDMA, in case you 
couldn't remember) 


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


NFSoRDMA developers bi-weekly meeting minutes (9/11)

2014-09-11 Thread Shirley Ma
Attendees:

Yan Burman (Mellanox)
Wendy Cheng (Intel)
Rupert Dance (Soft Forge)
Steve Dickson (Red Hat)
Chuck Lever (Oracle)
Doug Ledford (RedHat)
Shirley Ma (Oracle)
Devesh Sharma (Emulex)
Anna Schumaker (Net App)
Steve Wise (OpenGridComputing, Chelsio)
Dominique Martinet(CEA France)

Moderator:
Shirley Ma (Oracle)

NFSoRDMA developers bi-weekly meeting is to help organizing NFSoRDMA 
development and test effort from different resources to speed up NFSoRDMA 
upstream kernel work and NFSoRDMA diagnosing/debugging tools development. 
Hopefully the quality of NFSoRDMA upstream patches can be improved by being 
tested with a quorum of HW vendors.

Today's meeting notes:
1. Merge plan for 3.18, 3.19 (Chuck, Steve):
-- Bug fixes from NFSoRDMA bugzilla 
- https://bugzilla.linux-nfs.org/buglist.cgi?quicksearch=rdma
- shutdown issue problem from unmount point (Devesh/Chuck)
-- NFS 4.1 support: 
- Server and client are under testing, Oracle is doing client/server 
interoperability test
- Bi-directional RPCs
- backchannel, second transport TCP

2. Linux development tree unbootable with IPoIB for nearly one month. 
(Chuck,Doug/SteveD,Yan)
-- How to help upstream test and stable
- Any vendors can fund UNH to zero-day Linux upstream test?
- Allocate Engineer resource to help reviewing code to speed up 
upstream acceptance?
-- Discussed about NFSoRDMA dependency of IPoIB (Wendy)

3. Performance issues arising (SteveW, Shirley)
-- Scalability test for 4 - 16 clients, each client has 4 mount points
-- Single eventQ bottle neck after splitting send/recv queue completion, two 
eventQ per QP
-- How to avoid inconsistent performance evaluation: hyper threads, NUMA, cache 
...
-- RDMA vs. IPoIB iWARP, ROCE performance test

4. UNL interoperability bug discussion between PPC (64K page) and X86 (4K page) 
(Rupert, Chuck, SteveW)
-- http://bugs.openfabrics.org/bugzilla/show_bug.cgi?id=2494
   - cloned https://bugzilla.linux-nfs.org/show_bug.cgi?id=270

Actions:
1. Chuck talks to Oracle for possibility to join OFLG and fund UNH for linux 
development tree test
2. Allocate Engineer resource to help reviewing upstream code and speed up the 
process: RedHat (Doug), Mellanox(Yan),Oracle(Chuck)

Feel free to reply here for anything missing. See you 9/25.

9/11/2014
@7:30am PDT
@8:30am MDT
@9:30am CDT
@10:30am EDT
@Bangalore @8:00pm
@Israel @5:30pm

Duration: 1 hour

Call-in number:
Israel: +972 37219638
Bangalore: +91 8039890080 (180030109800)
France  Colombes +33 1 5760 +33 176728936
US: 8666824770,  408-7744073

Conference Code: 2308833
Passcode: 63767362 (it's NFSoRDMA, in case you couldn't remember)

Thanks everyone for joining the call and providing valuable inputs/work to the 
community to make NFSoRDMA better.

Shirley
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [for-next 1/2] xprtrdma: take reference of rdma provider module

2014-07-18 Thread Shirley Ma

On 07/18/2014 06:27 AM, Steve Wise wrote:
 We can't really deal with a CM_DEVICE_REMOVE event while there are
 active NFS mounts.

 System shutdown ordering should guarantee (one would hope) that NFS
 mount points are unmounted before the RDMA/IB core infrastructure is
 torn down. Ordering shouldn't matter as long all NFS activity has
 ceased before the CM tries to remove the device.

 So if something is hanging up the CM, there's something xprtrdma is
 not cleaning up properly.



 Devesh, how are you reproducing this?  Are you just rmmod'ing the ocrdma
 module while there are active mounts?

 Yes, I am issuing rmmod while there is an active mount. In my case rmmod 
 ocrdma remains
 blocked forever.
Where is it blocked?

 Off-the-course of this discussion: Is there a reasoning behind not using
 ib_register_client()/ib_unregister_client() framework?
 
 I think the idea is that you don't need to use it if you are 
 transport-independent and use
 the rdmacm...
 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [for-next 1/2] xprtrdma: take reference of rdma provider module

2014-07-17 Thread Shirley Ma


On 07/17/2014 12:55 PM, Steve Wise wrote:
 
 
 -Original Message-
 From: Hefty, Sean [mailto:sean.he...@intel.com]
 Sent: Thursday, July 17, 2014 2:50 PM
 To: Steve Wise; 'Shirley Ma'; 'Devesh Sharma'; 'Roland Dreier'
 Cc: linux-rdma@vger.kernel.org; chuck.le...@oracle.com
 Subject: RE: [for-next 1/2] xprtrdma: take reference of rdma provider module

 So the rdma cm is expected to increase the driver reference count
 (try_module_get) for
 each new cm id, then deference count (module_put) when cm id is
 destroyed?


 No, I think he's saying the rdma-cm posts a RDMA_CM_DEVICE_REMOVAL event
 to each
 application with rdmacm objects allocated, and each application is expected
 to destroy all
 the objects it has allocated before returning from the event handler.

 This is almost correct.  The applications do not have to destroy all the 
 objects that it
 has
 allocated before returning from their event handler.  E.g. an app can queue 
 a work item
 that does the destruction.  The rdmacm will block in its ib_client remove 
 handler until
 all
 relevant rdma_cm_id's have been destroyed.

 
 Thanks for the clarification.  

Thanks, checked the cma code, the reference count maintains there.

 And I think the ib_verbs core calls each ib_client's remove handler when an
 rdma provider
 unregisters with the core.

 yes
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 00/21] NFS/RDMA client patches for 3.17

2014-07-16 Thread Shirley Ma
These two patches have been significant reduced interrupt rate by around 4 
times.
 
 xprtrdma: Disable completions for FAST_REG_MR Work Requests
 xprtrdma: Disable completions for LOCAL_INV Work Requests 

Same NFS read/write workload, here are interrupts rate irq/per sec report based 
upon /proc/interrupts:

w/o patches:
---
PCI-MSI-edge mlx4-ib (204):  105176
PCI-MSI-edge mlx4-ib (204):  123650
PCI-MSI-edge mlx4-ib (204):  123690
PCI-MSI-edge mlx4-ib (204):  116554
PCI-MSI-edge mlx4-ib (204):  122864

And perf stat irq report: 
Performance counter stats for 'system wide':

 2,131,870  irq:irq_handler_entry   
 [100.00%]
 2,131,870  irq:irq_handler_exit
 [100.00%]
   635,587  irq:softirq_entry   
 [100.00%]
   635,597  irq:softirq_exit
 [100.00%]
   636,155  irq:softirq_raise

  25.422821792 seconds time elapsed

w/i patches:
---
PCI-MSI-edge mlx4-ib (204):  31131
PCI-MSI-edge mlx4-ib (204):  32958
PCI-MSI-edge mlx4-ib (204):  31068
PCI-MSI-edge mlx4-ib (204):  30236
PCI-MSI-edge mlx4-ib (204):  33041

And perf stat irq report:

Performance counter stats for 'system wide':

   653,548  irq:irq_handler_entry   
 [100.00%]
   653,548  irq:irq_handler_exit
 [100.00%]
   568,138  irq:softirq_entry   
 [100.00%]
   568,148  irq:softirq_exit
 [100.00%]
   568,690  irq:softirq_raise   


  21.675597062 seconds time elapsed

Shirley

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 00/21] NFS/RDMA client patches for 3.17

2014-07-16 Thread Shirley Ma


On 07/16/2014 10:57 AM, Chuck Lever wrote:
 
 On Jul 16, 2014, at 11:48 AM, Shirley Ma shirley...@oracle.com wrote:

 These two patches have been significantly reduced interrupt rate by around 4 
 times.

 xprtrdma: Disable completions for FAST_REG_MR Work Requests
 xprtrdma: Disable completions for LOCAL_INV Work Requests 
 
 Thanks Shirley!  This is result applies only to FRMR, correct? Also, i'd 
 imagine the savings would be even greater for adapters that have short page 
 list depth?

Yes, only tested FRMR with mlx4. I can hack the code to test short page page 
list depth to check the savings. When looking the difference between irq and 
softirq, it is much closer now.

 

 Same NFS read/write workload, here are interrupts rate irq/per sec report 
 based upon /proc/interrupts:

 w/o patches:
 ---
 PCI-MSI-edge mlx4-ib (204):  105176
 PCI-MSI-edge mlx4-ib (204):  123650
 PCI-MSI-edge mlx4-ib (204):  123690
 PCI-MSI-edge mlx4-ib (204):  116554
 PCI-MSI-edge mlx4-ib (204):  122864

 And perf stat irq report: 
 Performance counter stats for 'system wide':

 2,131,870  irq:irq_handler_entry 
[100.00%]
 2,131,870  irq:irq_handler_exit  
[100.00%]
   635,587  irq:softirq_entry 
[100.00%]
   635,597  irq:softirq_exit  
[100.00%]
   636,155  irq:softirq_raise

  25.422821792 seconds time elapsed

 w/i patches:
 ---
 PCI-MSI-edge mlx4-ib (204):  31131
 PCI-MSI-edge mlx4-ib (204):  32958
 PCI-MSI-edge mlx4-ib (204):  31068
 PCI-MSI-edge mlx4-ib (204):  30236
 PCI-MSI-edge mlx4-ib (204):  33041

 And perf stat irq report:

 Performance counter stats for 'system wide':

   653,548  irq:irq_handler_entry 
[100.00%]
   653,548  irq:irq_handler_exit  
[100.00%]
   568,138  irq:softirq_entry 
[100.00%]
   568,148  irq:softirq_exit  
[100.00%]
   568,690  irq:softirq_raise 
   

  21.675597062 seconds time elapsed

 Shirley

 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 15/21] xprtrdma: Disable completions for LOCAL_INV Work Requests

2014-07-10 Thread Shirley Ma
That would significantly reduce the interrupts rate. I will run some 
performance test.

Would these unsignaled SENDs be possible to stall? 

Shirley

On 07/09/2014 09:58 AM, Chuck Lever wrote:
 Instead of relying on a completion to change the state of an FRMR
 to FRMR_IS_INVALID, set it in advance. If an error occurs, a completion
 will fire anyway and mark the FRMR FRMR_IS_STALE.
 
 Signed-off-by: Chuck Lever chuck.le...@oracle.com
 ---
  net/sunrpc/xprtrdma/verbs.c |   17 -
  1 file changed, 8 insertions(+), 9 deletions(-)
 
 diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
 index e83eda6..8eef343 100644
 --- a/net/sunrpc/xprtrdma/verbs.c
 +++ b/net/sunrpc/xprtrdma/verbs.c
 @@ -154,12 +154,8 @@ rpcrdma_sendcq_process_wc(struct ib_wc *wc)
  
   if (wc-wr_id == 0ULL)
   return;
 - if (wc-status != IB_WC_SUCCESS) {
 + if (wc-status != IB_WC_SUCCESS)
   frmr-r.frmr.fr_state = FRMR_IS_STALE;
 - return;
 - }
 -
 - frmr-r.frmr.fr_state = FRMR_IS_INVALID;
  }
  
  static int
 @@ -1369,12 +1365,11 @@ rpcrdma_retry_local_inv(struct rpcrdma_mw *r, struct 
 rpcrdma_ia *ia)
   dprintk(RPC:   %s: FRMR %p is stale\n, __func__, r);
  
   /* When this FRMR is re-inserted into rb_mws, it is no longer stale */
 - r-r.frmr.fr_state = FRMR_IS_VALID;
 + r-r.frmr.fr_state = FRMR_IS_INVALID;
  
   memset(invalidate_wr, 0, sizeof(invalidate_wr));
   invalidate_wr.wr_id = (unsigned long)(void *)r;
   invalidate_wr.opcode = IB_WR_LOCAL_INV;
 - invalidate_wr.send_flags = IB_SEND_SIGNALED;
   invalidate_wr.ex.invalidate_rkey = r-r.frmr.fr_mr-rkey;
   DECR_CQCOUNT(r_xprt-rx_ep);
  
 @@ -1787,10 +1782,11 @@ rpcrdma_deregister_frmr_external(struct 
 rpcrdma_mr_seg *seg,
   struct ib_send_wr invalidate_wr, *bad_wr;
   int rc;
  
 + seg1-mr_chunk.rl_mw-r.frmr.fr_state = FRMR_IS_INVALID;
 +
   memset(invalidate_wr, 0, sizeof invalidate_wr);
   invalidate_wr.wr_id = (unsigned long)(void *)seg1-mr_chunk.rl_mw;
   invalidate_wr.opcode = IB_WR_LOCAL_INV;
 - invalidate_wr.send_flags = IB_SEND_SIGNALED;
   invalidate_wr.ex.invalidate_rkey = 
 seg1-mr_chunk.rl_mw-r.frmr.fr_mr-rkey;
   DECR_CQCOUNT(r_xprt-rx_ep);
  
 @@ -1799,9 +1795,12 @@ rpcrdma_deregister_frmr_external(struct rpcrdma_mr_seg 
 *seg,
   rpcrdma_unmap_one(ia, seg++);
   rc = ib_post_send(ia-ri_id-qp, invalidate_wr, bad_wr);
   read_unlock(ia-ri_qplock);
 - if (rc)
 + if (rc) {
 + /* Force rpcrdma_buffer_get() to retry */
 + seg1-mr_chunk.rl_mw-r.frmr.fr_state = FRMR_IS_STALE;
   dprintk(RPC:   %s: failed ib_post_send for invalidate,
status %i\n, __func__, rc);
 + }
   return rc;
  }
  
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 05/21] xprtrdma: On disconnect, don't ignore pending CQEs

2014-07-09 Thread Shirley Ma
Should all rdma_clean_cq be replaced by flush_cqs? The outstanding CQEs should 
be processed in any context. 

Shirley

On 07/09/2014 09:57 AM, Chuck Lever wrote:
 xprtrdma is currently throwing away queued completions during
 a reconnect. RPC replies posted just before connection loss, or
 successful completions that change the state of an FRMR, can be
 missed.
 
 Signed-off-by: Chuck Lever chuck.le...@oracle.com
 ---
  net/sunrpc/xprtrdma/verbs.c |   14 +-
  1 file changed, 9 insertions(+), 5 deletions(-)
 
 diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
 index 0d5187d..7fd457e 100644
 --- a/net/sunrpc/xprtrdma/verbs.c
 +++ b/net/sunrpc/xprtrdma/verbs.c
 @@ -310,6 +310,13 @@ rpcrdma_recvcq_upcall(struct ib_cq *cq, void *cq_context)
   rpcrdma_recvcq_poll(cq, ep);
  }
  
 +static void
 +rpcrdma_flush_cqs(struct rpcrdma_ep *ep)
 +{
 + rpcrdma_recvcq_upcall(ep-rep_attr.recv_cq, ep);
 + rpcrdma_sendcq_upcall(ep-rep_attr.send_cq, ep);
 +}
 +
  #ifdef RPC_DEBUG
  static const char * const conn[] = {
   address resolved,
 @@ -872,9 +879,7 @@ retry:
   if (rc  rc != -ENOTCONN)
   dprintk(RPC:   %s: rpcrdma_ep_disconnect
status %i\n, __func__, rc);
 -
 - rpcrdma_clean_cq(ep-rep_attr.recv_cq);
 - rpcrdma_clean_cq(ep-rep_attr.send_cq);
 + rpcrdma_flush_cqs(ep);
  
   xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
   id = rpcrdma_create_id(xprt, ia,
 @@ -985,8 +990,7 @@ rpcrdma_ep_disconnect(struct rpcrdma_ep *ep, struct 
 rpcrdma_ia *ia)
  {
   int rc;
  
 - rpcrdma_clean_cq(ep-rep_attr.recv_cq);
 - rpcrdma_clean_cq(ep-rep_attr.send_cq);
 + rpcrdma_flush_cqs(ep);
   rc = rdma_disconnect(ia-ri_id);
   if (!rc) {
   /* returns without wait if not connected */
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


No NFSoRDMA bi-weekly meeting from now to Aug.

2014-07-08 Thread Shirley Ma
Most of us will be on vacation in the summer so there will be no bi-weekly 
meetings from now to Aug. We will resume the meeting in Sept.

Let me know if you would like to have some discussion occasionally. I will be 
around.

Enjoy your summer!

Shirley
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


NFSoRDMA developers bi-weekly meeting minutes (6/25)

2014-07-02 Thread Shirley Ma
Sorry for being late, I just found this email wasn't out. Please add what's 
missing.

Attendees:

Rupert Dance (Soft Forge)
Chuck Lever (Oracle)
Doug Ledford (RedHat)
Shirley Ma (Oracle)
Anna Schumaker (Net App)
Steve Wise (OpenGridComputing, Chelsio)
Steve Dickson (RedHat)

Jeff is busy on OFED stack work, others are on vacation

Moderator:
Shirley Ma (Oracle)

NFSoRDMA developers bi-weekly meeting is to help organizing NFSoRDMA 
development and test effort from different resources to speed up NFSoRDMA 
upstream kernel work and NFSoRDMA diagnosing/debugging tools development. 
Hopefully the quality of NFSoRDMA upstream patches can be improved by being 
tested by a quorum of HW vendors.

Today's meeting notes:
1. Follow ups from last week:
Steve has created a test git tree for client and server patches that are 
heading upstream so interested parties can test out. Here is the link to the 
git tree for testing:

The branch is named for-test and the repo is at: 
git://git.linux-nfs.org/projects/swise/linux.git

2. Rupert gave an update for NFSoRDMA patches to be included in the coming 
OFED-3.12-1 release which is targeted in July. Jeff has ported all patches from 
upstream tree, UNL is testing these patches to see any regression.

3. Chuck created destructive test patch to inject IB verbs errors randomly in 
NFSoRDMA. This can easily trigger NFSoRDMA bugs. This approach can simulate 
cable plug/unplug, rpcrdma connect/reconnect stale resource issue... This patch 
will benefit developers for testing and debugging. Steven mentioned Chelsio has 
similar test method in userspace.

4. Discussed about Chuck's most recent xprtrdma patchset to upstream, decided 
to the reference count to memory window buffer list patch is not needed.

5. Shirley discussed NFSoRDMA performance anaylysis tools to measure CPU, 
bandwidth, latency, lock, interrupts rate, memory foot print. Suggestion is to 
use current lock_stat, perf to report/monitor cpu, lock statistics, use nfsstat 
to report NFS statistics first; Shirley suggested to add more statistics in 
RDMA operations for performance analysis. Shirley will write some documentation 
on performance analysis tool.

6. NFSoRDMA virtualization doesn't work. Shirley found that VF FRMR reported 
port error after a successful post_send on connextX-2. It didn't seem an 
NFSoRDMA issue. Questions are whether VF FRMR is fully tested. Later Yan 
confirmed VF FRMR doesn't work for NFSoRDMA on connectX-3 either.

7. Mike reported that NFSoRDMA doesn't show same performance as SMB for fio 
test. Shirley is able to reproduce it, Steve will ask Chelsio engineer to test 
it as well.

Meeting time: one hour discussion every other Wed (next meeting will be
on 7/6). A reminder will be sent out to both linux-nfs and linux-rdma
mailing list:

6/25/2014
@8:00am PST (summer time)
@9:00am MST (summer time)
@10:00am CST (summer time)
@11:00am EST (summer time)
@Bangalore @8:00pm
@Israel @6:00pm

Duration: 1 hour

Call-in number:
Israel: +972 37219638
France  Colombes +33 1 5760 +33 176728936
Bangalore: +91 8039890080 (180030109800)
US: 8666824770,  408-7744073

Conference Code: 2308833
Passcode: 63767362 (it's NFSoRDMA, in case you couldn't remember)

Thanks everyone for joining the call and providing valuable inputs/work to the 
community to make NFSoRDMA better.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSoRDMA developers bi-weekly meeting minutes (6/25)

2014-07-02 Thread Shirley Ma
Yes, thanks for the correction.

Shirley

On 07/02/2014 12:18 PM, Chuck Lever wrote:
 
 On Jul 2, 2014, at 12:11 PM, Shirley Ma shirley...@oracle.com wrote:
 
 7. Mike reported that NFSoRDMA doesn't show same performance as SMB for fio 
 test.
 
 Shirley, did you mean Mark Lehrer ?
 
 http://marc.info/?l=linux-rdmam=140260285527831w=2
 
 Shirley is able to reproduce it, Steve will ask Chelsio engineer to test it 
 as well.
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 00/13] NFS/RDMA patches for 3.17

2014-06-27 Thread Shirley Ma
Passed cthon4, iozone test interoperability test between Linux client and 
Solaris server, no regressions.

Shirley

On 06/25/2014 03:47 PM, Steve Wise wrote:
 Hey Chuck, 
 
 I did some testing on this series.  Just 2 nodes, both nfs3 and nfs4 over 
 cxgb4 and mlx4:
 
 cthon b/g/s 10 iterations
 iozone -a with direct IO and data validation
 fio write and rand-rw testing of large IO/files and 8 threads.
 xfs test suite.
 
 No regressions seen.
 
 Tested-by: Steve Wise sw...@opengridcomputing.com
 
 
 -Original Message-
 From: linux-nfs-ow...@vger.kernel.org 
 [mailto:linux-nfs-ow...@vger.kernel.org] On Behalf
 Of Chuck Lever
 Sent: Monday, June 23, 2014 5:39 PM
 To: linux-rdma@vger.kernel.org; linux-...@vger.kernel.org
 Subject: [PATCH v1 00/13] NFS/RDMA patches for 3.17

 The main purpose of this series is to address more connection drop
 recovery issues by fixing FRMR re-use to make it less likely the
 client will drop the connection due to a memory operation error.

 Some other clean-ups and fixes are present as well.

 See topic branch nfs-rdma-for-3.17 in

   git://git.linux-nfs.org/projects/cel/cel-2.6.git

 I tested with NFSv3 and NFSv4 on all three supported memory
 registration modes. Used cthon04 and iozone with both Solaris
 and Linux NFS/RDMA servers. Used xfstests with Linux.

 ---

 Chuck Lever (13):
   xprtrdma: Fix panic in rpcrdma_register_frmr_external()
   xprtrdma: Protect -qp during FRMR deregistration
   xprtrdma: Limit data payload size for ALLPHYSICAL
   xprtrdma: Update rkeys after transport reconnect
   xprtrdma: Don't drain CQs on transport disconnect
   xprtrdma: Unclutter struct rpcrdma_mr_seg
   xprtrdma: Encode Work Request opcode in wc-wr_id
   xprtrdma: Back off rkey when FAST_REG_MR fails
   xprtrdma: Refactor rpcrdma_buffer_put()
   xprtrdma: Release FRMR segment buffers during LOCAL_INV completion
   xprtrdma: Clean up rpcrdma_ep_disconnect()
   xprtrdma: Remove RPCRDMA_PERSISTENT_REGISTRATION macro
   xprtrdma: Handle additional connection events


  include/linux/sunrpc/xprtrdma.h |2
  net/sunrpc/xprtrdma/rpc_rdma.c  |   77 +
  net/sunrpc/xprtrdma/transport.c |   17 +-
  net/sunrpc/xprtrdma/verbs.c |  330 
 +++
  net/sunrpc/xprtrdma/xprt_rdma.h |   63 ++-
  5 files changed, 332 insertions(+), 157 deletions(-)

 --
 Chuck Lever
 --
 To unsubscribe from this list: send the line unsubscribe linux-nfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: help with IB_WC_MW_BIND_ERR

2014-06-26 Thread Shirley Ma
Hello Eli, Or,

Do you know who can help on this? NFSoRDMA hits this error case with Mellanox 
ConnectX-2 HCAs.

Thanks
Shirley

On 05/20/2014 11:55 AM, Chuck Lever wrote:
 Hi-
 
 What does it mean when a LOCAL_INV work request fails with a
 IB_WC_MW_BIND_ERR completion?
 
 --
 Chuck Lever
 chuck[dot]lever[at]oracle[dot]com
 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


NFSoRDMA bi-weekly meeting reminder

2014-06-25 Thread Shirley Ma
6/25/2014
@8:00am PST summer time
@9:00am MST summer time
@10:00am CST summer time
@11:00am EST summer time
@Bangalore @8:00pm
@Israel @6:00pm

Duration: 1 hour

Call-in number:
Israel: +972 37219638
Bangalore: +91 8039890080 (180030109800)
France  Colombes +33 1 5760 +33 176728936
US: 8666824770,  408-7744073

Conference Code: 2308833
Passcode: 63767362 (it's NFSoRDMA, in case you couldn't remember)

Thanks
Shirley
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 10/13] xprtrdma: Release FRMR segment buffers during LOCAL_INV completion

2014-06-25 Thread Shirley Ma

On 06/25/2014 07:32 AM, Chuck Lever wrote:
 Hi Shirley-
 
 On Jun 25, 2014, at 1:17 AM, Shirley Ma shirley...@oracle.com wrote:
 
 Would it be possible to delay rpcrdma_buffer_put() until LOCAL_INV request 
 send completion? remove rpcrdma_buffer_put() from xprt_rdma_free(), add a 
 call back after LOCAL_INV completed?
 
 That’s exactly what this patch does. The relevant part of
 rpcrdma_buffer_put() is:
 
   list_add(mw-mw_list, buf-rb_mws);
 
 This is now wrapped with a reference count so that
 rpcrdma_buffer_put() and the LOCAL_INV completion can run in any
 order. The FRMR is added back to the list only after both of those
 two have finished.

What I was thinking is to run rpcrdma_buffer_put after LOCAL_INV completion 
without reference count.
 
 Nothing in xprt_rdma_free() is allowed to sleep, so we can’t wait for
 LOCAL_INV completion in there.
 
 The only alternative I can think of is having rpcrdma_buffer_get() check
 fr_state as it removes FRMRs from the rb_mws list. Only if the FRMR is
 marked FRMR_IS_INVALID, rpcrdma_buffer_get() will add it to the
 rpcrdma_req.

I thought about it too, an atomic operation would be better than a lock.

 --
 Chuck Lever
 chuck[dot]lever[at]oracle[dot]com
 
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 10/13] xprtrdma: Release FRMR segment buffers during LOCAL_INV completion

2014-06-24 Thread Shirley Ma
Would it be possible to delay rpcrdma_buffer_put() until LOCAL_INV request send 
completion? remove rpcrdma_buffer_put() from xprt_rdma_free(), add a call back 
after LOCAL_INV completed?

Shirley

On 06/23/2014 03:40 PM, Chuck Lever wrote:
 FRMR uses a LOCAL_INV Work Request, which is asynchronous, to
 deregister segment buffers.  Other registration strategies use
 synchronous deregistration mechanisms (like ib_unmap_fmr()).
 
 For a synchronous deregistration mechanism, it makes sense for
 xprt_rdma_free() to put segment buffers back into the buffer pool
 immediately once rpcrdma_deregister_external() returns.
 
 This is currently also what FRMR is doing. It is releasing segment
 buffers just after the LOCAL_INV WR is posted.
 
 But segment buffers need to be put back after the LOCAL_INV WR
 _completes_ (or flushes). Otherwise, rpcrdma_buffer_get() can then
 assign these segment buffers to another RPC task while they are
 still in use by the hardware.
 
 The result of re-using an FRMR too quickly is that it's rkey
 no longer matches the rkey that was registered with the provider.
 This results in FAST_REG_MR or LOCAL_INV Work Requests completing
 with IB_WC_MW_BIND_ERR, and the FRMR, and thus the transport,
 becomes unusable.
 
 Signed-off-by: Chuck Lever chuck.le...@oracle.com
 ---
  net/sunrpc/xprtrdma/verbs.c |   44 
 +++
  net/sunrpc/xprtrdma/xprt_rdma.h |2 ++
  2 files changed, 42 insertions(+), 4 deletions(-)
 
 diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
 index f24f0bf..52f57f7 100644
 --- a/net/sunrpc/xprtrdma/verbs.c
 +++ b/net/sunrpc/xprtrdma/verbs.c
 @@ -62,6 +62,8 @@
  #endif
  
  static void rpcrdma_decrement_frmr_rkey(struct rpcrdma_mw *);
 +static void rpcrdma_get_mw(struct rpcrdma_mw *);
 +static void rpcrdma_put_mw(struct rpcrdma_mw *);
  
  /*
   * internal functions
 @@ -167,6 +169,7 @@ rpcrdma_sendcq_process_wc(struct ib_wc *wc)
   if (fastreg)
   rpcrdma_decrement_frmr_rkey(mw);
   }
 + rpcrdma_put_mw(mw);
  }
  
  static int
 @@ -1034,7 +1037,7 @@ rpcrdma_buffer_create(struct rpcrdma_buffer *buf, 
 struct rpcrdma_ep *ep,
   len += cdata-padding;
   switch (ia-ri_memreg_strategy) {
   case RPCRDMA_FRMR:
 - len += buf-rb_max_requests * RPCRDMA_MAX_SEGS *
 + len += (buf-rb_max_requests + 1) * RPCRDMA_MAX_SEGS *
   sizeof(struct rpcrdma_mw);
   break;
   case RPCRDMA_MTHCAFMR:
 @@ -1076,7 +1079,7 @@ rpcrdma_buffer_create(struct rpcrdma_buffer *buf, 
 struct rpcrdma_ep *ep,
   r = (struct rpcrdma_mw *)p;
   switch (ia-ri_memreg_strategy) {
   case RPCRDMA_FRMR:
 - for (i = buf-rb_max_requests * RPCRDMA_MAX_SEGS; i; i--) {
 + for (i = (buf-rb_max_requests+1) * RPCRDMA_MAX_SEGS; i; i--) {
   r-r.frmr.fr_mr = ib_alloc_fast_reg_mr(ia-ri_pd,
   ia-ri_max_frmr_depth);
   if (IS_ERR(r-r.frmr.fr_mr)) {
 @@ -1252,12 +1255,36 @@ rpcrdma_buffer_destroy(struct rpcrdma_buffer *buf)
  }
  
  static void
 -rpcrdma_put_mw_locked(struct rpcrdma_mw *mw)
 +rpcrdma_free_mw(struct kref *kref)
  {
 + struct rpcrdma_mw *mw = container_of(kref, struct rpcrdma_mw, mw_ref);
   list_add_tail(mw-mw_list, mw-mw_pool-rb_mws);
  }
  
  static void
 +rpcrdma_put_mw_locked(struct rpcrdma_mw *mw)
 +{
 + kref_put(mw-mw_ref, rpcrdma_free_mw);
 +}
 +
 +static void
 +rpcrdma_get_mw(struct rpcrdma_mw *mw)
 +{
 + kref_get(mw-mw_ref);
 +}
 +
 +static void
 +rpcrdma_put_mw(struct rpcrdma_mw *mw)
 +{
 + struct rpcrdma_buffer *buffers = mw-mw_pool;
 + unsigned long flags;
 +
 + spin_lock_irqsave(buffers-rb_lock, flags);
 + rpcrdma_put_mw_locked(mw);
 + spin_unlock_irqrestore(buffers-rb_lock, flags);
 +}
 +
 +static void
  rpcrdma_buffer_put_mw(struct rpcrdma_mw **mw)
  {
   rpcrdma_put_mw_locked(*mw);
 @@ -1304,6 +1331,7 @@ rpcrdma_buffer_get_mws(struct rpcrdma_req *req, struct 
 rpcrdma_buffer *buffers)
   r = list_entry(buffers-rb_mws.next,
   struct rpcrdma_mw, mw_list);
   list_del(r-mw_list);
 + kref_init(r-mw_ref);
   r-mw_pool = buffers;
   req-rl_segments[i].mr_chunk.rl_mw = r;
   }
 @@ -1583,6 +1611,7 @@ rpcrdma_register_frmr_external(struct rpcrdma_mr_seg 
 *seg,
   dprintk(RPC:   %s: Using frmr %p to map %d segments\n,
   __func__, seg1-mr_chunk.rl_mw, i);
  
 + rpcrdma_get_mw(seg1-mr_chunk.rl_mw);
   if (unlikely(seg1-mr_chunk.rl_mw-r.frmr.fr_state == FRMR_IS_VALID)) {
   dprintk(RPC:   %s: frmr %x left valid, posting 
 invalidate.\n,
   __func__,
 @@ -1595,6 +1624,7 @@ rpcrdma_register_frmr_external(struct rpcrdma_mr_seg 
 *seg,
   invalidate_wr.send_flags = IB_SEND_SIGNALED;
   

Re: rmmod mlx4_core panic 3.16-rc1

2014-06-20 Thread Shirley Ma


On 06/19/2014 08:34 PM, Or Gerlitz wrote:
 On Thu, Jun 19, 2014 at 6:33 AM, Shirley Ma shirley...@oracle.com wrote:

 1. Whether IB VFs is supported in ConnectX-2 (mlx4 driver)?

 I tried to num_vfs={port1,port2,port1+2} when loading mlx4_core module, it 
 failed with mlx4_core :40:00.0: Invalid syntax of num_vfs/probe_vfs with 
 IB port - single port VFs syntax is only supported when all ports are 
 configured as ethernet
 
 
 What do you mean by port1 and port2 -- can you give the exact
 command line you used?
 
 Single ported VFs are currently supported for Ethernet only
 configuration, that is not for only IB nor for VPI, that is only if
 you use port_type_arrary=2,2
 

I tried command line with num_vfs without port_type_array=2,2.

num_vfs=2 
num_vfs={1,1,2}

both failed.

 


 2. After mlx4_core module is being loaded with with num_vfs={} parameters, 
 when removing mlx4_core, it consistently hits below panic. Whether this 
 problem is being tracked?
 
 
 what do you mean by  num_vfs={}, is it num_vfs=N or {N}, also here,
 please send the exact setting you used. The crash you indicated below
 is supposed to be fixed by the upstream  commit
 da1de8dfff09d33d4a5345762c21b487028e25f5 net/mlx4_core: Keep only one
 driver entry release - are you sure to have this commit in the tree
 you are working with?
 
 Or.

Yes, I tried net-next tree with this commit 
a1de8dfff09d33d4a5345762c21b487028e25f5.


 mlx4_ib mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v2.2-1 
 (Feb 2014)
 mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
 mlx4_core: Initializing :40:00.0
 mlx4_core :40:00.0: Enabling SR-IOV with 2 VFs
 pci :40:00.1: [15b3:1002] type 00 class 0x0c0600
 mlx4_core: Initializing :40:00.1
 mlx4_core :40:00.1: enabling device ( - 0002)
 mlx4_core :40:00.1: Skipping virtual function:1
 pci :40:00.2: [15b3:1002] type 00 class 0x0c0600
 mlx4_core: Initializing :40:00.2
 mlx4_core :40:00.2: enabling device ( - 0002)
 mlx4_core :40:00.2: Skipping virtual function:2
 mlx4_core :40:00.0: Running in master mode
 mlx4_core :40:00.0: PCIe BW is different than device's capability
 mlx4_core :40:00.0: PCIe link speed is 5.0GT/s, device supports 8.0GT/s
 mlx4_core :40:00.0: PCIe link width is x8, device supports x8
 mlx4_core :40:00.0: Invalid syntax of num_vfs/probe_vfs with IB port - 
 single port VFs syntax is only supported when all ports are configured as 
 ethernet
 BUG: unable to handle kernel NULL pointer dereference at 038c
 IP: [a0350450] __mlx4_remove_one+0x20/0x380 [mlx4_core]
 PGD 45d3ba067 PUD 45ace8067 PMD 0
 Oops:  [#1] SMP DEBUG_PAGEALLOC
 Modules linked in: mlx4_core(-) ebtable_nat ebtables ipt_MASQUERADE 
 iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM iptable_mangle bridge stp llc 
 autofs4 cpufreq_ondemand ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 
 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 
 xt_state nf_conntrack ip6table_filter ip6_tables dm_mirror dm_region_hash 
 dm_log dm_mod vhost_net macvtap macvlan vhost tun kvm_intel kvm iTCO_wdt 
 iTCO_vendor_support microcode ipmi_si ipmi_msghandler acpi_cpufreq pcspkr 
 i2c_i801 i2c_core lpc_ich mfd_core shpchp sg ioatdma ib_sa ib_mad ib_core 
 ib_addr ipv6 vxlan ixgbe dca ptp pps_core hwmon mdio ext3 jbd mbcache sd_mod 
 crc_t10dif crct10dif_common usb_storage ahci libahci mpt2sas 
 scsi_transport_sas raid_class [last unloaded: mlx4_core]
 CPU: 13 PID: 7212 Comm: rmmod Not tainted 3.16.0-rc1+ #1
 Hardware name: Oracle Corporation SUN FIRE X4170 M3 /ASSY,MOTHERBOARD,1U 
   , BIOS 17050100 08/29/2013
 task: 880461540110 ti: 88046500 task.ti: 88046500
 RIP: 0010:[a0350450]  [a0350450] 
 __mlx4_remove_one+0x20/0x380 [mlx4_core]
 RSP: 0018:880465003d88  EFLAGS: 00010296
 RAX: 0001 RBX:  RCX: 
 RDX: 0026 RSI: 0292 RDI: 880468b8f000
 RBP: 880465003db8 R08:  R09: 
 R10: 09f911029d74e35b R11: 09f911029d74e35b R12: 
 R13: 880468b8f000 R14: a036de40 R15: 0001
 FS:  7ff287fc2700() GS:88046fce() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 038c CR3: 00045cfae000 CR4: 000407e0
 Stack:
  880465003da8 880468b8f000  880468b8f000
  a036de40 0001 880465003dd8 a0350805
  880468b8f098 a036dd60 880465003e08 812ebaa6
 Call Trace:
  [a0350805] mlx4_remove_one+0x25/0x50 [mlx4_core]
  [812ebaa6] pci_device_remove+0x46/0xc0
  [813ce08f] __device_release_driver+0x7f/0xf0
  [813ce1c8] driver_detach+0xc8/0xd0
  [813cced9] bus_remove_driver+0x59/0xd0
  [813cef80] driver_unregister+0x30/0x70
  [812ebc13] pci_unregister_driver+0x23/0x80

Re: rmmod mlx4_core panic 3.16-rc1

2014-06-20 Thread Shirley Ma


On 06/19/2014 11:17 PM, Or Gerlitz wrote:
 On Fri, Jun 20, 2014 at 6:34 AM, Or Gerlitz or.gerl...@gmail.com wrote:
 On Thu, Jun 19, 2014 at 6:33 AM, Shirley Ma shirley...@oracle.com wrote:
 1. Whether IB VFs is supported in ConnectX-2 (mlx4 driver)?
 I tried to num_vfs={port1,port2,port1+2} when loading mlx4_core module, it 
 failed with mlx4_core :40:00.0: Invalid syntax of num_vfs/probe_vfs 
 with IB port - single port VFs syntax is only supported when all ports are 
 configured as ethernet
 
 What do you mean by port1 and port2 -- can you give the exact
 command line you used?
 
 Single ported VFs are currently supported for Ethernet only
 configuration, that is not for only IB nor for VPI, that is only if
 you use port_type_arrary=2,2
 
 Note that you can still use dual-ported VFs, for both IB, Ethernet and
 VPI, that is
 num_vfs=N will create N dual-ported VFs, are you on IB?

Yes, I tried num_vfs=N. It failed. I didn't combine with port_type_array=2,2. 
If it's required, then the code needs to check the arguments.

 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rmmod mlx4_core panic 3.16-rc1

2014-06-20 Thread Shirley Ma
When loading the modules, the proper arguments need to be checked if one 
depends on another. You can easily reproduce this problem by only using num_vfs 
= N.

Shirley

On 06/19/2014 11:34 PM, Or Gerlitz wrote:
 On Fri, Jun 20, 2014 at 9:33 AM, Or Gerlitz or.gerl...@gmail.com wrote:
 On Fri, Jun 20, 2014 at 9:21 AM, Wei Yang weiy...@linux.vnet.ibm.com wrote:

 From this log, it happens during probe?
 If not, any action after probe?

 yep, maybe the bug still exists in the error flow of probe? you can probe 
 with
 num_vfs=1,1,1 port_type_array=1,1 and see if you hit it

 I tried this
 modprobe mlx4_core num_vfs=3 probe_vf=3 port_type_array=1,1
 It looks good to me.

 NO. I wanted you to hit the error flow where this bug seems to
 remain... so you need to try and use single ported VFs with IB e.g
 $ modprobe mlx4_core num_vfs=1,1,1 port_type_array=1,1
 
 or
 
 $ modprobe mlx4_core num_vfs=1,1,1 probe_vf=1,1,1 rt_type_array=1,1
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to re-use a QP for a new connection

2014-06-20 Thread Shirley Ma
The QP can be reused. The rdma_id_private has a field reuseaddr. What 
additional change is needed besides rdma_set_reuseaddr?

Shirley

On 06/20/2014 02:17 PM, Hefty, Sean wrote:
 During a remote transport disconnect, the QP leaves RTS.

 xprtrdma deals with this in a separate transport connect worker process,
 where it creates a new id and qp, and replaces the existing id and qp.

 Unfortunately there are parts of xprtrdma (namely FRMR deregistration)
 that are not easy to serialize with this reconnect logic.

 Re-using the QP would mean no serialization would be needed between
 transport reconnect and FRMR deregistration.

 If QP re-use is not supported, though, it's not worth considering any
 further.
 
 It may be possible to reuse the QP, just not the rdma_cm_id without 
 additional code changes.  Reuse of the rdma_cm_id may also require changes in 
 the underlying IB/iWarp CMs.
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to re-use a QP for a new connection

2014-06-20 Thread Shirley Ma
On 06/20/2014 03:30 PM, Chuck Lever wrote:
 Hi Shirley-
 
 I’ve found that to move the QP back to the IB_QPS_INIT state, I need to
 call ib_modify_qp() with a specific set of attributes, including the
 pkey_index and port_num.
 
 rdma_init_qp_attr() extracts those attributes. But, when I try to call it
 after rdma_disconnect(), the rdma_cm_id is not in the RDMA_CM_IDLE state,
 and the call fails.
 
 So I can’t get the QP back to the INIT state unless the rdma_cm_id has
 somehow been reset.

I see, we need to have rdma_reset_id() to change the cm_id state to 
RDMA_CM_IDLE.
 
 I suppose I could call rdma_init_qp_attr() while the transport is still
 connected, and save the returned attributes.
Maybe we can save ib_qp_attr in xprtrdma rpcrdma_ia?

 On Jun 20, 2014, at 6:24 PM, Shirley Ma shirley...@oracle.com wrote:
 
 The QP can be reused. The rdma_id_private has a field reuseaddr. What 
 additional change is needed besides rdma_set_reuseaddr?

 Shirley

 On 06/20/2014 02:17 PM, Hefty, Sean wrote:
 During a remote transport disconnect, the QP leaves RTS.

 xprtrdma deals with this in a separate transport connect worker process,
 where it creates a new id and qp, and replaces the existing id and qp.

 Unfortunately there are parts of xprtrdma (namely FRMR deregistration)
 that are not easy to serialize with this reconnect logic.

 Re-using the QP would mean no serialization would be needed between
 transport reconnect and FRMR deregistration.

 If QP re-use is not supported, though, it's not worth considering any
 further.

 It may be possible to reuse the QP, just not the rdma_cm_id without 
 additional code changes.  Reuse of the rdma_cm_id may also require changes 
 in the underlying IB/iWarp CMs.
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 
 --
 Chuck Lever
 chuck[dot]lever[at]oracle[dot]com
 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


rmmod mlx4_core panic 3.16-rc1

2014-06-18 Thread Shirley Ma
Hello Or,

Two questions here:
1. Whether IB VFs is supported in ConnectX-2 (mlx4 driver)?

I tried to num_vfs={port1,port2,port1+2} when loading mlx4_core module, it 
failed with
mlx4_core :40:00.0: Invalid syntax of num_vfs/probe_vfs with IB port - 
single port VFs syntax is only supported when all ports are configured as 
ethernet

2. After mlx4_core module is being loaded with with num_vfs={} parameters, when 
removing mlx4_core, it consistently hits below panic. Whether this problem is 
being tracked?

Thanks
Shirley


mlx4_ib mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v2.2-1 (Feb 
2014)
mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
mlx4_core: Initializing :40:00.0
mlx4_core :40:00.0: Enabling SR-IOV with 2 VFs
pci :40:00.1: [15b3:1002] type 00 class 0x0c0600
mlx4_core: Initializing :40:00.1
mlx4_core :40:00.1: enabling device ( - 0002)
mlx4_core :40:00.1: Skipping virtual function:1
pci :40:00.2: [15b3:1002] type 00 class 0x0c0600
mlx4_core: Initializing :40:00.2
mlx4_core :40:00.2: enabling device ( - 0002)
mlx4_core :40:00.2: Skipping virtual function:2
mlx4_core :40:00.0: Running in master mode
mlx4_core :40:00.0: PCIe BW is different than device's capability
mlx4_core :40:00.0: PCIe link speed is 5.0GT/s, device supports 8.0GT/s
mlx4_core :40:00.0: PCIe link width is x8, device supports x8
mlx4_core :40:00.0: Invalid syntax of num_vfs/probe_vfs with IB port - 
single port VFs syntax is only supported when all ports are configured as 
ethernet
BUG: unable to handle kernel NULL pointer dereference at 038c
IP: [a0350450] __mlx4_remove_one+0x20/0x380 [mlx4_core]
PGD 45d3ba067 PUD 45ace8067 PMD 0 
Oops:  [#1] SMP DEBUG_PAGEALLOC
Modules linked in: mlx4_core(-) ebtable_nat ebtables ipt_MASQUERADE iptable_nat 
nf_nat_ipv4 nf_nat xt_CHECKSUM iptable_mangle bridge stp llc autofs4 
cpufreq_ondemand ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter 
ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack 
ip6table_filter ip6_tables dm_mirror dm_region_hash dm_log dm_mod vhost_net 
macvtap macvlan vhost tun kvm_intel kvm iTCO_wdt iTCO_vendor_support microcode 
ipmi_si ipmi_msghandler acpi_cpufreq pcspkr i2c_i801 i2c_core lpc_ich mfd_core 
shpchp sg ioatdma ib_sa ib_mad ib_core ib_addr ipv6 vxlan ixgbe dca ptp 
pps_core hwmon mdio ext3 jbd mbcache sd_mod crc_t10dif crct10dif_common 
usb_storage ahci libahci mpt2sas scsi_transport_sas raid_class [last unloaded: 
mlx4_core]
CPU: 13 PID: 7212 Comm: rmmod Not tainted 3.16.0-rc1+ #1
Hardware name: Oracle Corporation SUN FIRE X4170 M3 /ASSY,MOTHERBOARD,1U   
, BIOS 17050100 08/29/2013
task: 880461540110 ti: 88046500 task.ti: 88046500
RIP: 0010:[a0350450]  [a0350450] 
__mlx4_remove_one+0x20/0x380 [mlx4_core]
RSP: 0018:880465003d88  EFLAGS: 00010296
RAX: 0001 RBX:  RCX: 
RDX: 0026 RSI: 0292 RDI: 880468b8f000
RBP: 880465003db8 R08:  R09: 
R10: 09f911029d74e35b R11: 09f911029d74e35b R12: 
R13: 880468b8f000 R14: a036de40 R15: 0001
FS:  7ff287fc2700() GS:88046fce() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 038c CR3: 00045cfae000 CR4: 000407e0
Stack:
 880465003da8 880468b8f000  880468b8f000
 a036de40 0001 880465003dd8 a0350805
 880468b8f098 a036dd60 880465003e08 812ebaa6
Call Trace:
 [a0350805] mlx4_remove_one+0x25/0x50 [mlx4_core]
 [812ebaa6] pci_device_remove+0x46/0xc0
 [813ce08f] __device_release_driver+0x7f/0xf0
 [813ce1c8] driver_detach+0xc8/0xd0
 [813cced9] bus_remove_driver+0x59/0xd0
 [813cef80] driver_unregister+0x30/0x70
 [812ebc13] pci_unregister_driver+0x23/0x80
 [a03650e4] mlx4_cleanup+0x10/0x1e [mlx4_core]
 [810ceff9] SyS_delete_module+0x189/0x210
 [815d2f12] system_call_fastpath+0x16/0x1b
Code: 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 53 
48 83 ec 08 66 66 66 66 90 48 8b 9f 58 01 00 00 49 89 fd 44 8b b3 8c 03 00 00 
45 85 f6 0f 85 41 02 00 00 f6 43 08 04 44 
RIP  [a0350450] __mlx4_remove_one+0x20/0x380 [mlx4_core]
 RSP 880465003d88
CR2: 038c
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nfs-rdma performance

2014-06-13 Thread Shirley Ma

On 06/12/2014 04:06 PM, Mark Lehrer wrote:
 I am using ConnectX-3 HCA's and Dell R720 servers.
 
 On Thu, Jun 12, 2014 at 2:00 PM, Steve Wise sw...@opengridcomputing.com 
 wrote:
 On 6/12/2014 2:54 PM, Mark Lehrer wrote:

 Awesome work on nfs-rdma in the later kernels!  I had been having
 panic problems for awhile and now things appear to be quite reliable.

 Now that things are more reliable, I would like to help work on speed
 issues.  On this same hardware with SMB Direct and the standard
 storage review 8k 70/30 test, I get combined read  write performance
 of around 2.5GB/sec.  With nfs-rdma it is pushing about 850MB/sec.
 This is simply an unacceptable difference.

I was able to get close to 2.5GB/s with ConnectX-2 for direct I/O. What's your 
test case and wsize/rsize? Did you collect /proc/interrupts, cpu usage and 
profiling data?


 I'm using the standard settings -- connected mode, 65520 byte MTU,
 nfs-server-side async, lots of nfsd's, and nfsver=3 with large
 buffers.  Does anyone have any tuning suggestions and/or places to
 start looking for bottlenecks?


 What RDMA device?

 Steve.
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


NFSoRDMA developers bi-weekly meeting minutes (6/11)

2014-06-13 Thread Shirley Ma
Attendees:

Jeff Beck (NASA)
Susan Coulter (LANL)
Rupert Dance (Soft Forge)
Chuck Lever (Oracle)
Doug Ledford (RedHat)
Shirley Ma (Oracle)
Devesh Sharma (Emulex)
Anna Schumaker (Net App)
Steve Wise (OpenGridComputing, Chelsio)

Moderator:
Shirley Ma (Oracle)

NFSoRDMA developers bi-weekly meeting is to help organizing NFSoRDMA 
development and test effort from different resources to speed up NFSoRDMA 
upstream kernel work and NFSoRDMA diagnosing/debugging tools development. 
Hopefully the quality of NFSoRDMA upstream patches can be improved by being 
tested by a quorum of HW vendors.

Today's meeting notes:
1. Rupert gave an update of OFED 3.12-1 
OFED 3.12-1 will support RedHat 7.0. It will include Intel MPSS OFED packages 
and Intel Xeon support, recent NFSoRDMA patches from Shirley, Chuck, which Jeff 
has backported, Steve Wise Chelsio code as well as Intel iWARP code. OFED 
3.12-1 will be GAed in 1-2 month. It will be backword compatible with RHEL7.

2. Follow-up on last week's soft RoCE / soft iWARP discussion
Chuck Lever gave an update of Soft RoCE, which has no update and no resource 
for maintenance, only userspace memory registration, therefore our kernel 
NFSoRDMA implementation is not supported
Steve Wise gave an update of Soft iWARP, which has been used with our kernel 
NFSoRDMA implementation before, and will be merged to upstream soon.

3. Reviews of patches going into 3.16
Anna has checked in Chuck's 24 client patches, Steve Wise refactoring patch for 
large I/O is also in the upstream merge tree.
Devesh has run cthon and IOzone test of 3.15 kernel and all the tests have been 
passed, no issue found.
Good news, the most recent upstream NFSoRDMA is much more stable now!

4. Anna gave an update on upstream merge process. Developers will test/validate 
the QA tree branch Steven is going to build again RC2 or later based up Chuck's 
client tree and Steve's server tree. Anna will pull patches once the patches 
blessed by developers. Trond/Bruce can merge these patches to upstream.

5. Jeff gave an update of merging NFSoRDMA patches into OFED-3.12-1. Jeff will 
manually pull Steve Wise patches into OFED tree.

6. Shirley still had trouble to make guest VF working for NFSoRDMA. There was 
no SRIOV test in IOL lab. Rupert will pass SRIOV questions Shirley raised to 
get help from Mellanox.

Meeting time: one hour discussion every other Wed (next meeting will be
on 6/25). A reminder will be sent out to both linux-nfs and linux-rdma
mailing list:

6/25/2014
@8:00am PST
@9:00am MST
@10:00am CST
@11:00am EST
@Bangalore @8:00pm
@Israel @6:00pm

Duration: 1 hour

Call-in number:
Israel: +972 37219638
Bangalore: +91 8039890080 (180030109800)
US: 8666824770,  408-7744073
Conference Code: 2308833
Passcode: 63767362 (it's NFSoRDMA, in case you couldn't remember)

Thanks everyone for joining the call and providing valuable inputs/work to the 
community to make NFSoRDMA better.

Shirley
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


NFSoRDMA developers bi-weekly meeting minutes (5/28)

2014-05-29 Thread Shirley Ma
Attendees:

Jeff Beck (NASA)
Yan Burman (Mellanox)
Wendy Cheng (Intel)
Susan Coulter (LANL)
Rupert Dance (Soft Forge)
Chuck Lever (Oracle)
Doug Ledford (RedHat)
Shirley Ma (Oracle)
Devesh Sharma (Emulex)
Anna Schumaker (Net App)
Steve Wise (OpenGridComputing, Chelsio)

Moderator:
Shirley Ma (Oracle)

NFSoRDMA developers bi-weekly meeting is to help organizing NFSoRDMA 
development and test effort from different resources to speed up NFSoRDMA 
upstream kernel work and NFSoRDMA diagnosing/debugging tools development. 
Hopefully the quality of NFSoRDMA upstream patches can be improved by being 
tested by a quorum of HW vendors.

Today's meeting notes:
1. OFED release update from Rupert Dance through email:

a. OFED 3.12 was released yesterday without any patch from Chuck's git tree, 
one of the reason these patches haven't upstream yet. There were a number of 
new bugs filed on NFSRDMA (2489 and 2490).
http://bugs.openfabrics.org/bugzilla/show_bug.cgi?id=2489
http://bugs.openfabrics.org/bugzilla/show_bug.cgi?id=2490

b. Jeff Becker has pulled all of the patches from Chuck's git tree and made 
backports in his local branch of OFED 3.12. He has begun testing and is seeing 
good results so far.

c. OFED next release will be OFED 3.12-1 and it will be including all these 
updates along with support for RHEL 7.0 and Intel's OFED MIC.

2. NFSoRDMA support with soft RoCE and Soft iWARP

There were some discussions regarding whether removing RPCRDMA_REGISTER support 
(one of Chuck's patchset) would impact any other components in the stack. So 
far soft RoCE hasn't been upstreamed yet.There was a broad consensus not to 
support out-of-tree providers unless an issue also affects in-tree providers. 
NFSoRDMA will follow kernel development policy, all work should be based upon 
upstream kernel. However Chuck Level will check Soft RoCE plan, Steve Wise will 
check soft iWARP plan to make sure nothing will be broken in both kernel and 
OFED release. 

3. Update on testing NFSoRDMA client patches:

Devesh Sharma, Doug Ledford, Chuck Lever, Steve Wise all have tested Chuck git 
tree (up to last weeks patchsets) on different platforms with various vendor's 
HCAs. The test showed the stack pretty reliable for both NFSv3 and NFSv4. 
However NFSv4.1 hit server crash. (NFSv4.1 hasn't support yet).

Steve Wise's test covers iWARP Chelsio
Devesh Sharma's test covers Emulex
Chuck Lever/Shirley Ma's test covers Mellanox
Doug Ledford's test covers various platforms and combination of HCAs 
(Interoperability test).
Jeff Beck's test covers OFED release (backport)
Rupert Dance's (IOL) team test covers various platforms and combination of HCAs 
as well.

The test coverage should be good enough for NFSoRDMA client patchsets to be 
merged to upstream by Anna and Trond.

Devesh is looking for performance benchmark tools. IOzone is recommended. Anna 
is going to send more performance tools.

A place to save test results as scratch sheets has been discussed so it will be 
easy to track the test history for any regressions. Anna will help to figure it 
out.

A couple of new bugs have been filed to track existing issues. Devesh had hit a 
bug in dbench test, which Steve Wise already worked on it.

https://bugzilla.linux-nfs.org/show_bug.cgi?id=255

Klemens Senn has reported a soft lockup in unloading kernel module. Shirley has 
tried to reproduce this problem with Linux server, Solaris client, it didn't 
hit any issue. So it's a problem between Linux client and server.

https://bugzilla.linux-nfs.org/show_bug.cgi?id=252


4. Steve shared his findings on some bug he has bee working on -- refactoring 
patchset.

5. Followups update from last week
a. Linux server maintenance is still in unresolved status.

b. NFSoRDMA debugging and diagnosis tools?
Yan has made some progress on NFSoRDMA wireshark dissector. Selecting 
connection is not as simple as TCP, Yan has tried to use QP number/RDMA 
establish status to build the connection. Chuck suggested to try RPC XID field.

c. NFSoRDMA virtualization validation:
Shirley has set up KVM guest with Mellanox Connect2 SRIOV. A panic occurred 
right away during mount, the panic is different with XEN domU guest.

Next meeting topics proposal:
1. Follow up the work has been discussed from this meeting.

2. Walk through some of the stories on pivotal, link is as below:
https://www.pivotaltracker.com/s/projects/958376

3. Invite some of the developers to discuss some of their requirements and 
features.

Meeting time: one hour discussion every other Wed (next meeting will be
on 6/11). A reminder will be sent out to both linux-nfs and linux-rdma
mailing list:

6/11/2014
@8:00am PST
@9:00am MST
@10:00am CST
@11:00am EST
@Bangalore @9:00pm
@Israel @6:00pm

Duration: 1 hour

Call-in number:
Israel: +972 37219638
Bangalore: +91 8039890080 (180030109800)
US: 8666824770,  408-7744073
Conference Code: 2308833
Passcode: 63767362 (it's NFSoRDMA, in case you couldn't remember)

Thanks everyone for joining the call

Re: Soft lockup in unloading kernel modules

2014-05-19 Thread Shirley Ma

Klements,

Can you add more details on how to unloading the modules (step by step) 
in the bug report?


Thanks
Shirley

On 05/19/2014 10:51 AM, Chuck Lever wrote:

Hi Klemens-

On May 13, 2014, at 12:48 PM, Klemens Senn klemens.s...@ims.co.at wrote:


Hi Anna,

today I retried unloading the kernel modules with your updated kernel
and additionally I tried the nfsd-next kernel from J. Bruce Fields and
Chuck's nfs-rdma-client kernel.

I filed

   https://bugzilla.linux-nfs.org/show_bug.cgi?id=252

to track this issue.



In short: None of these was able to unload the kernel modules with an
active connection.

In detail:

With your kernel I got following 3 faults:
  o BUG: soft lockup - CPU#0 stuck for 22s! [modprobe:4615]
  o BUG: unable to handle kernel NULL pointer dereference at
0003
  o BUG: unable to handle kernel paging request at 5b8c

With the nfsd-next kernel I got following results:
  o BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:4452]
  o module unloading blocks forever, dmesg shows:
nfsd: last server has exited, flushing export cache
waiting module removal not supported: please upgrade
  o Kernel keeps running but reports the following:
nfsd: last server has exited, flushing export cache
waiting module removal not supported: please upgrade
svc_xprt_enqueue: threads and transports both waiting??
INFO: task modprobe:4510 blocked for more than 480 seconds.
  Not tainted 3.15.0-rc1-bfields-master+ #1
echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this
message.
modprobeD 88087fc13440 0  4510   4458 0x
 88105bb23c58 0086 88105c14e690 00013440
 88105bb23fd8 00013440 81a14480 88105c14e690
 0037 88085d7f74d8 88085d7f74e0 7fff
Call Trace:
 [815a2424] schedule+0x24/0x70
 [815a18cc] schedule_timeout+0x1ec/0x260
 [8159a504] ? printk+0x5c/0x5e
 [815a3406] wait_for_completion+0x96/0x100
 [81080c90] ? try_to_wake_up+0x2b0/0x2b0
 [a0314039] cma_remove_one+0x1a9/0x220 [rdma_cm]
 [a01fea86] ib_unregister_device+0x46/0x120 [ib_core]
 [a02c5dc9] mlx4_ib_remove+0x29/0x260 [mlx4_ib]
 [a04319d0] mlx4_remove_device+0xa0/0xc0 [mlx4_core]
 [a0431a2b] mlx4_unregister_interface+0x3b/0xa0 [mlx4_core]
 [a02d74cc] mlx4_ib_cleanup+0x10/0x23 [mlx4_ib]
 [810bd612] SyS_delete_module+0x152/0x220
 [81149684] ? vm_munmap+0x54/0x70
 [815ad5a6] system_call_fastpath+0x1a/0x1f

With the nfs-rdma-client I got following results:
  o module unloading blocks forever, dmesg shows:
nfsd: last server has exited, flushing export cache
svc_xprt_enqueue: threads and transports both waiting??
  o BUG: unable to handle kernel paging request at 4dec
IP: [815a63b5] _raw_spin_lock_bh+0x15/0x40
PGD 107ba9a067 PUD 105c093067 PMD 0
Oops: 0002 [#1] SMP
Modules linked in: nfsd nfs_acl auth_rpcgss oid_registry svcrdma
dm_mod cpuid nfs fscache lockd sunrpc af_packet 8021q garp stp llc
rdma_ucm ib_ucm rdma_cm iw_cm ib_ipoib ib_cm ib_uverbs ib_umad mlx4_en
mlx4_ib(-) ib_sa ib_mad ib_core ib_addr sr_mod cdrom usb_storage joydev
mlx4_core usbhid x86_pkg_temp_thermal coretemp kvm_intel kvm
ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul
glue_helper ehci_pci aes_x86_64 ehci_hcd isci iTCO_wdt libsas pcspkr
iTCO_vendor_support igb i2c_algo_bit sb_edac lpc_ich edac_core ioatdma
usbcore tpm_tis ptp microcode i2c_i801 sg mfd_core scsi_transport_sas
ipmi_si usb_common tpm wmi pps_core dca ipmi_msghandler acpi_cpufreq
button edd autofs4 xfs libcrc32c crc32c_intel processor thermal_sys
scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh
CPU: 14 PID: 4813 Comm: modprobe Not tainted
3.15.0-rc5-cel-nfs-rdma-client-unpatched+ #2
Hardware name: Supermicro B9DRG-E/B9DRG-E, BIOS 3.0 09/04/2013
task: 88085bf96190 ti: 88085d42a000 task.ti: 88085d42a000
RIP: 0010:[815a63b5]  [815a63b5]
_raw_spin_lock_bh+0x15/0x40
RSP: 0018:88085d42bd18  EFLAGS: 00010286
RAX: 0001 RBX: 4de8 RCX: 
RDX: 000b RSI: 000e RDI: 4dec
RBP: 88085d42bd18 R08: 88087c611f38 R09: a140
R10: 002b R11:  R12: 88085dcc3c00
R13: 88105ca13280 R14: 4dec R15: 4df0
FS:  7f0e49fb5700() GS:88107fcc()
knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 4dec CR3: 00105b027000 CR4: 000407e0
Stack:
 88085d42bd58 a03bd9f0 01328b88 88085dcc3c00
 88085dce8000 88105ca13280 88085dce8260 88085dce81c8
 88085d42bd78 a0441ce9 

NFSoRDMA developers bi-weekly meeting summary (5/14)

2014-05-14 Thread Shirley Ma

Attendees:
Allen Andrews (Emulex)
Jeff Beck (NASA)
Yan Burman (Mellanox)
Wendy Cheng (Intel)
Rupert Dance (Soft Forge)
Chuck Lever (Oracle)
Doug Ledford (RedHat)
Shirley Ma (Oracle)
Anna Schumaker (Net App)
Steve Wise (OpenGridComputing, Chelsio)

Moderator:
Shirley Ma (Oracle)

NFSoRDMA developers bi-weekly meeting is to help organizing NFSoRDMA 
development and test effort from different resources to speed up 
NFSoRDMA upstream kernel work and NFSoRDMA diagnosing/debugging tools 
development. Hopefully the quality of NFSoRDMA upstream patches can be 
improved by being tested by a quorum of HW vendors.


Today's meeting notes:
1. Flesh out the process for testing and reviewing upstream patch sets:

What kind of test cases to run?
The upstream patch sets are required to test and pass functional test 
suites. In long run, performance test suites will be required. The 
performance test suites will be added later. Link to NFSoRDMA test wiki:
http://wiki.linux-nfs.org/wiki/index.php/NfsRdmaClient/Home#Submitting_patches 



How would linux-nfs git tree help NFSoRDMA patch sets upstream to 
mainline kernel?
Steve Wise will create linux-nfs NFSoRDMA server git tree with the help 
from Anna Schumaker. Chuck Lever's client branch git tree includes all 
client patch sets to be tested. Anna Schumaker maintains all client 
patches to be pulled into upstream which have passed the test suites 
(Sign-off: Tested-by). Steve Wise will create a branch to include both 
client and server patch sets in case you use the same node for both 
client and server.


How these tests to be done?
Each developer has his/her own hardware to test before submitting the 
patch sets. IOL has a broad variety of hardware, Rupert Dance would 
check how they can help in additional to OFED stack test. The test build 
process will be defined later.


2. Identify a long term resource for maintaining the linux NFSoRDMA server

Oracle has allocated resource to support NFSoRDMA client. Linux 
community needs funds to support NFSoRDMA server. There are engineer 
resources but not the money. Any third parties are interested in 
supporting NFSoRDMA server?


3. Follow ups from last meeting:

NFSoRDMA debugging and diagnosis tools?
Yan has started to look at NFSoRDMA dissector. Chelsio packets can be 
monitored when turning on switch port mirroring.
Chuck has created linux-nfs bugzilla for NFSoRDMA: server component is 
svcrdma, client component is xprtrdma . Here is the link for you to 
check/submit bugs

https://bugzilla.linux-nfs.org/

4. NFSoRDMA visualization validation:
Yan and Shirley are working on validating ConnectX SRIOV on both Xen and 
KVM. They hit couple issues which need further debugging.


Next meeting topics proposal:
1. Follow up the work has been discussed from this meeting.

2. Walk through some of the stories on pivotal, link is as below:
https://www.pivotaltracker.com/s/projects/958376

3. Invite some of the developers to discuss some of their requirements 
and features.


Meeting time: one hour discussion every other Wed (next meeting will be 
on 5/14). A reminder will be sent out to both linux-nfs and linux-rdma 
mailing list:


5/28/2014
@8:00am PST
@9:00am MST
@10:00am CST
@11:00am EST
@Bangalore @9:00pm
@Israel @6:00pm

Duration: 1 hour

Call-in number:
Israel: +972 37219638
Bangalore: +91 8039890080 (180030109800)
US: 8666824770,  408-7744073
Conference Code: 2308833
Passcode: 63767362 (it's NFSoRDMA, in case you couldn't remember)

Thanks everyone for joining the call and providing valuable inputs/work 
to the community to make NFSoRDMA better.


Shirley
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSoRDMA developers bi-weekly meeting announcement (4/30)

2014-04-30 Thread Shirley Ma

On 04/30/2014 01:00 PM, Or Gerlitz wrote:

On Wed, Apr 30, 2014 at 10:47 PM, Chuck Lever chuck.le...@oracle.com


If I understood Yan, he is trying to use NFS/RDMA in guests (kvm?).  We
are pretty sure that is not working at the moment,

can you provide a short 1-2 liner why/what is broken there? the only
thing which I can think of to be not-supported over mlx4 VFs is the
proprietary FMRs, but AFAIK, the nfs-rdma code doesn't even have a
mode which uses them, right?
I've created Xen guest on DomU. Dom0 PF works which has no mtts been 
enabled, however DomU I hit this problem by just mounting the file system:

mlx4_core :00:04.0: Failed to allocate mtts for 66 pages(order 7)
mlx4_core :00:04.0: Failed to allocate mtts for 4096 pages(order 12)
mlx4_core :00:04.0: Failed to allocate mtts for 4096 pages(order 12)

RDMA microbenchmark perftest works ok. I enabled mtts scripts when 
booting the Xen guest. cat /proc/mtrr:


[root@ca-nfsdev1vm1 log]# cat /proc/mtrr
reg00: base=0x0f000 ( 3840MB), size=  128MB, count=1: uncachable
reg01: base=0x0f800 ( 3968MB), size=   64MB, count=1: uncachable

lspci -v
00:04.0 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 
Virtual Function] (rev b0)

Subsystem: Mellanox Technologies Device 61b0
Physical Slot: 4
Flags: bus master, fast devsel, latency 0
Memory at f000 (64-bit, prefetchable) [size=128M]
Capabilities: [60] Express Endpoint, MSI 00
Capabilities: [9c] MSI-X: Enable+ Count=4 Masked-
Kernel driver in use: mlx4_core
Kernel modules: mlx4_core

I will need to find another machine to try KVM guest. Yan might hit a 
different problem.


I have ConnectX-2, FW level is 2.11.2012. Yan has ConnectX-3, he tried 
it on KVM guest.

but that is a priority
to get fixed. Shirley has a lab set up and has been looking into it.

Shirley
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSoRDMA developers bi-weekly meeting announcement (4/30)

2014-04-30 Thread Shirley Ma


On 04/30/2014 04:58 PM, Doug Ledford wrote:

On 04/302014 Shirley Ma wrote:

On 04/30/2014 01:00 PM, Or Gerlitz wrote:

On Wed, Apr 30, 2014 at 10:47 PM, Chuck Lever
chuck.le...@oracle.com


If I understood Yan, he is trying to use NFS/RDMA in guests
(kvm?).  We
are pretty sure that is not working at the moment,

can you provide a short 1-2 liner why/what is broken there? the
only
thing which I can think of to be not-supported over mlx4 VFs is the
proprietary FMRs, but AFAIK, the nfs-rdma code doesn't even have a
mode which uses them, right?

I've created Xen guest on DomU. Dom0 PF works which has no mtts been
enabled, however DomU I hit this problem by just mounting the file
system:
mlx4_core :00:04.0: Failed to allocate mtts for 66 pages(order 7)
mlx4_core :00:04.0: Failed to allocate mtts for 4096 pages(order
12)
mlx4_core :00:04.0: Failed to allocate mtts for 4096 pages(order
12)

RDMA microbenchmark perftest works ok. I enabled mtts scripts when
booting the Xen guest. cat /proc/mtrr:

What OS/RDMA stack are you using?  I'm not familiar with any mtts
scripts, however I know there is an mtrr fixup script I wrote for
the RDMA stack in Fedora/RHEL (and so I assume it's in Oracle Linux
too, but I haven't checked).  In fact, I assume that's the script
you are referring to based on the fact that your next bit of your
email cats the /proc/mtrr file.  But I don't believe whether there
is an mtrr setting mixup or not that is should have any impact on
the mtts allocations in the driver.  Even if your mtrr registers
were set incorrectly, the problem then becomes either A) a serious
performance bottleneck (in the case of Intel hardware that needs
write combining in order to get more than about 50MByte/s of
throughput on their cards) or B) failed operation because MMIO
writes to the card are being cached/write combined when they should
not be.

I suspect this is more likely Xen related than mtts/mtrr related.
Yes. That's the script I used. I wonder whether it's possible to disable 
mtrr on DomU guest to debug this. I am new to Xen.

[root@ca-nfsdev1vm1 log]# cat /proc/mtrr
reg00: base=0x0f000 ( 3840MB), size=  128MB, count=1: uncachable
reg01: base=0x0f800 ( 3968MB), size=   64MB, count=1: uncachable

lspci -v
00:04.0 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2
Virtual Function] (rev b0)
  Subsystem: Mellanox Technologies Device 61b0
  Physical Slot: 4
  Flags: bus master, fast devsel, latency 0
  Memory at f000 (64-bit, prefetchable) [size=128M]
  Capabilities: [60] Express Endpoint, MSI 00
  Capabilities: [9c] MSI-X: Enable+ Count=4 Masked-
  Kernel driver in use: mlx4_core
  Kernel modules: mlx4_core

I will need to find another machine to try KVM guest. Yan might hit a
different problem.

I have ConnectX-2, FW level is 2.11.2012. Yan has ConnectX-3, he
tried
it on KVM guest.

but that is a priority
to get fixed. Shirley has a lab set up and has been looking into
it.

Shirley
--
To unsubscribe from this list: send the line unsubscribe linux-rdma
in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html